GCAT - Link Prediction in Knowledge Graphs
GCAT - Link Prediction in Knowledge Graphs
SUPERVISOR
MSc. Le Ngoc Thanh
Department of Computer Science
2
DECLARATION
We hereby declare that this is our own research work. The data and research results
presented in this thesis are truthful and have not been duplicated from any other
projects.
All research results presented in this thesis are entirely truthful and accurate.
1
ACKNOWLEDGEMENTS
We would like to express our sincere gratitude to MSc Le Ngoc Thanh for his
dedicated guidance, for sharing his knowledge and experience, and for providing us
with valuable solutions throughout the completion of this graduation thesis.
We would also like to extend our thanks to the faculty members of the Faculty
of Information Technology, University of Science - Vietnam National University, Ho
Chi Minh City, who have imparted invaluable knowledge to us during our academic
journey.
Our gratitude also goes to the scientists and researchers whose work we have cited
and built upon to complete our thesis.
Lastly, we would like to thank our families, friends, and others who have always
supported and encouraged us during the process of working on this thesis.
Once again, our sincere thanks!
2
Table of Contents
Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
List of Symbols and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . 8
Glossary of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Abstract 1
Chapter 1 INTRODUCTION 1
3
4.1.2 Graph Embedding Techniques . . . . . . . . . . . . . . . . . . . 22
4.2 Multi-head Attention Mechanism . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Graph Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 KBGAT Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.1 Embedding Initialization . . . . . . . . . . . . . . . . . . . . . . 35
4.4.2 Encoder Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.3 ConvKB Prediction Model . . . . . . . . . . . . . . . . . . . . . 42
Chapter 5 EXPERIMENTS 44
5.1 Training Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 FB15k Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.2 FB15k-237 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.3 WN18 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.4 WN18RR Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Training Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 Training with the KBGAT Model . . . . . . . . . . . . . . . . . 48
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 6 CONCLUSION 52
REFERENCES 54
4
LIST OF FIGURES
5
LIST OF TABLES
6
LIST OF ALGORITHMS
7
LIST OF SYMBOLS AND ABBREVIATIONS
List of Symbols
Symbol Description
G Graph
Gmono Homogeneous graph
Ghete Heterogeneous graph
Gknow Knowledge graph
V, E Set of vertices, set of edges
e, ei Entity, the i-th entity
r, rk Relation, the k-th relation
tijk , tkij An edge/triple
→
−e ,→−r Entity embedding, relation embedding
⟨h, r, t⟩ A triple of head entity, relation, tail entity
T v, T e Set of vertex types, edge types
Ne , Nr Number of entities, number of relations
Nhead Number of self-attention heads
R Set of real numbers
E, R Entity embedding matrix, relation embedding matrix
S Training dataset
∗ Convolution operation
σ Non-linear activation function
W Weight matrix
fK
k=1 Concatenation from layer 1 to K
|| Concatenation
.T Transpose
||W ||22 L2 normalization
∨, ∧ Disjunction (OR), conjunction (AND)
⊕ Binary operation
∩ Intersection
¬ Negation
8
Vn
i=1 Chain conjunction
γ Margin
µ Learning rate
ω Number of convolutional layers
Ω Convolutional kernel/filter set
9
GLOSSARY OF TERMS
10
Abstract
A knowledge graph is a structure used to represent real-world informa-
tion, which has been successfully researched and developed by Google
for its search engine [15]. The exploitation of knowledge graphs not
only involves querying and analysis, but also completing missing in-
formation and predicting links based on the available data within the
graph. Therefore, in this report, we present an overview of knowl-
edge graphs and two methods for link prediction in graphs: rule-based
methods and deep learning-based methods.
For the rule-based method, we rely on the AnyBURL model and
propose two additional strategies for inserting new knowledge into the
graph.
For the deep learning-based method, we review the attention mech-
anism in natural language processing, which is then applied to knowl-
edge graphs, and present a fully detailed improved model called KB-
GAT. By stacking layers, nodes can attend to their neighboring fea-
tures without incurring additional computational cost or relying on
prior knowledge of the graph structure.
Our two models achieved significantly better results compared to
other link prediction methods applied on four standard datasets.
Chapter 1. INTRODUCTION
Nowadays, graphs have been applied in all aspects of life. Social network graphs (e.g.,
Facebook [45]) illustrate how individuals are connected to each other, the places we
visit, and the information we interact with. Graphs are also used as core structures
in video recommendation systems (e.g., YouTube [3]), flight networks, GPS naviga-
tion systems, scientific computations, and even brain connectivity analysis. Google’s
Knowledge Graph [15], introduced in 2012 [22], is a notable example of how information
can be structured and utilized in knowledge graphs.
Effectively exploiting knowledge graphs provides users with deeper insight into the
underlying data, which can benefit many real-world applications. However, in prac-
tice, new knowledge is continuously generated, and the acquired information is often
incomplete or missing. This leads to the problem of knowledge graph completion or
link prediction in knowledge graphs.
Most current approaches aim to predict a new edge connecting two existing nodes.
Such methods help make the graph more complete—i.e., denser—by introducing ad-
ditional connecting edges. However, these approaches primarily address the problem
of completion rather than the challenge of integrating new knowledge into the graph,
which remains an open question. Currently, research in knowledge graph completion
follows two main directions: one is optimizing an objective function to make predic-
tions with minimal error, as in RuDiK [35], AMIE [14], and RuleN [29], which are
typically used in vertex or edge classification applications. The other approach gener-
ates a ranked list of k candidate triples, where the score reflects decreasing confidence,
as seen in studies such as TransE [5] and ConvKB [48], which are commonly used in
recommendation systems. Our approach follows this second direction of producing a
candidate list.
Within these approaches, there are two main methodologies: rule-based systems
such as AnyBURL [27], and embedding-based methods such as ConvE [11], TransE
[5], and ComplEx [44]. With the goal of gaining a systematic understanding of these
1
methods, we chose to explore both directions in this thesis. For the rule-based ap-
proach, we selected AnyBURL [27], and for the graph embedding-based method, we
chose KBAT [32], which employs attention mechanisms.
Our contribution in the AnyBURL method includes a Python implementation 1 ,
along with two proposed strategies for adding new knowledge to the graph, which
we term online-to-offline and online-to-online. The online-to-offline strategy extends
AnyBURL by generating rules when a batch (set) of new knowledge is added. The
online-to-online strategy generates rules immediately when a single new piece of knowl-
edge (edge) is added.
For the embedding-based method, we present a review of attention mechanisms [46],
their application in knowledge graphs via Graph Attention Networks (GATs) [47], and
the KBAT model [32].
Our contribution in the deep learning approach includes a publicly available im-
plementation and training process on GitHub 2 , with both training code and model
results openly provided.
1 https://github.com/MinhTamPhan/mythesis
2 https://github.com/hmthanh/GCAT
2
Chapter 2. RELATED WORK
In this section, we present the basic definitions of knowledge graphs in order to under-
stand the task of link prediction in knowledge graphs, as well as other related research
directions.
Here, T v and T e are the sets of vertex types and edge types, respectively. Each
vertex vi ∈ V belongs to a specific type, i.e., fv (vi ) ∈ T v . Similarly, for eij ∈
E, fe (eij ) ∈ T e .
3
Definition 2 (Homogeneous Graph) Homogeneous graph: Ghomo = (V, E)
is a graph where | T v |=| T e |= 1. All vertices in G belong to a single type, and
all edges belong to a single type.
Example: in Figure 4.7, there are two triples: ⟨Tom Cruise, born in, New York⟩
and ⟨New York, state of, U.S⟩. Note that entities and relations in a knowledge
graph often belong to different types. Therefore, a knowledge graph can be viewed
as a specific case of a heterogeneous graph.
4
link (edge) in the graph, the task is to infer whether an edge with a specific label ex-
ists between two given nodes. Many methods have been proposed to learn rules from
graphs, such as in RuDiK [35], AMIE [14], and RuleN [29].
As mentioned earlier, there are two main approaches to this problem: one is op-
timizing an objective function to find a small set of rules that cover the majority of
correct examples with minimal error, as explored in RuDiK [35]. The other approach,
which we adopt in this thesis, aims to explore all possible rules and then generate a
top-k ranking of candidate triples, each associated with a confidence score measured
on the training set.
Our rule-based method is largely based on the Anytime Bottom-Up Rule Learning
for Knowledge Graph Completion method [28], hereafter referred to as AnyBURL.
As its name suggests, this method primarily focuses on completing the graph by filling
in missing parts. A key limitation of this model is that when a new edge or fact is
added to the graph, the entire model must be retrained. We address this issue using
two strategies: the offline-to-online strategy, which retrains a portion of the graph
only after a batch of new edges is added; and the online-to-online strategy, which
immediately retrains the affected portion of the graph whenever a new edge is added.
In the deep learning branch of approaches, many successful techniques from image
processing and natural language processing have been applied to knowledge graphs,
such as Convolutional Neural Networks (CNNs [24]), Recurrent Neural Networks (RNNs
[20]), and more recently, Transformers [50] and Capsule Neural Networks (CapsNets
[39]). In addition, other techniques such as random walks and hierarchical structure-
based models have also been explored. The common advantage of these deep learning
methods on knowledge graphs is their ability to automatically extract features and
generalize complex graph structures based on large amounts of training data. How-
ever, some methods focus mainly on grid-like structures and fail to preserve the spatial
characteristics of knowledge graphs.
The attention mechanism, particularly the multi-head attention layer, has been
applied to graphs through the Graph Attention Network (GAT [47]) model, which
aggregates information about an entity based on attention weights from its neighboring
entities. However, GAT lacks integration of relation embeddings and the embeddings of
an entity’s neighbors—components that are crucial for capturing the role of each entity.
This limitation has been addressed in the work Learning Attention-based Embeddings
5
for Relation Prediction in Knowledge Graphs (KBAT [32]), which we adopt as the
foundation for our study.
The attention mechanism is currently one of the most effective (state-of-the-art)
deep learning structures, as it has been proven to substitute any convolution operation
[9]. Moreover, it serves as a core component in leading models for natural language
processing, such as Megatron-LM [40], and image segmentation, such as HRNet-OCR
(Hierarchical Multi-Scale Attention [42]). Some recent works [10] have proposed inter-
esting improvements based on the attention mechanism. However, these advancements
have not yet been applied to knowledge graphs, which motivates us to adopt this family
of methods to integrate the latest innovations into knowledge graph modeling.
6
categorized and summarized in the survey [22], including: Knowledge Representation
Learning, Knowledge Acquisition, Temporal Knowledge Graphs, and Knowledge-aware
Applications. All research categories are illustrated in Figure 2.2.
Knowledge Representation Learning
Knowledge representation learning is an essential research topic in knowledge graphs
that enables a wide range of real-world applications. It is categorized into four sub-
groups:
Scoring Function studies how to measure the validity of a triple in practice. These
scoring functions may be distance-based or similarity-based.
Encoding Models investigate how to represent and learn interactions among rela-
tions. This is currently the main research direction, including linear or non-linear
models, matrix factorization, or neural network-based approaches.
Knowledge Acquisition
Knowledge acquisition focuses on how to extract or obtain knowledge based on
knowledge graphs, including knowledge graph completion, relation extraction, and en-
tity discovery. Relation extraction and entity discovery aim to extract new knowledge
(relations or entities) into the graph from text. Knowledge graph completion refers
to expanding an existing graph by inferring missing links. Research directions include
embedding-based ranking, relation path reasoning, rule-based reasoning, and hyper-
relational learning.
Entity discovery tasks include entity recognition, disambiguation, typing, and rank-
ing. Relation extraction models often employ attention mechanisms, graph convolu-
tional networks (GCNs), adversarial training, reinforcement learning (RL), deep learn-
ing, and transfer learning, which is the foundation of the method proposed in our
work.
In addition, other major research directions in knowledge graphs include tempo-
ral knowledge graphs and knowledge-aware applications. Temporal knowledge
7
graphs incorporate temporal information into the graph to learn temporal representa-
tions. Knowledge-aware applications include natural language understanding, question
answering, recommendation systems, and many other real-world tasks where integrat-
ing knowledge improves representation learning.
8
Chapter 3. RULE-BASED METHOD
In this chapter, we describe how the problem is reformulated using the rule-based
approach AnyBURL, including the rule (path) sampling algorithm and the rule gen-
eralization algorithm used to store learned knowledge in the model. We also present
our improvements to the training process when new knowledge (edges) is added to the
graph.
9
3.2 Definition of Logical Graph Language
Unlike general definitions of knowledge graphs commonly used in graph embedding
methods, our rule-based approach treats the graph as a formal language. Below are
the formal language definitions of the knowledge graph.
A knowledge graph Gknow is defined over a vocabulary ⟨C, R⟩, where C is the set of
constants and R is the set of binary predicates. Then, Gknow = {r(a, b) | r ∈ R, a, b ∈
C} is the set of ground atoms or facts. A binary predicate is referred to as a relation,
and a constant (or referenced constant) is referred to as an entity, corresponding to a
data entry in the training set. In what follows, we use lowercase letters for constants
and uppercase letters for variables. This is because we do not learn arbitrary Horn
rules; instead, we focus only on those rule types that can be generalized as discussed
below.
We define a rule as h(c0 , cn ) ← b1 (c0 , c1 ), . . . , bn (cn , cn+1 ), which is a path of ground
atoms of length n. Here, h(. . . ) is referred to as the head atom, and b1 (c0 , c1 ), . . . , bn (cn , cn+1 )
are referred to as the body atoms. We distinguish the following three types of rules:
- Binary rules (B): Rules in which the head atom contains two variables. - Unary
rules ending in a dangling node (Ud ): Rules where the head atom contains only one
variable, and the rule ends in a body atom that contains only variables (no constants).
- Unary rules ending in a constant (Uc ): Rules where the head atom also contains only
one variable, but the rule ends with an atom that may link to an arbitrary constant.
If that constant matches the constant in the head atom, the rule forms a cyclic path.
n
^
B : h(A0 , An ) ← bi (Ai−1 , Ai )
i=1
n
^
Ud : h(A0 , c) ← bi (Ai−1 , Ai ) (3.1)
i=1
n−1
^
Uc : h(A0 , c) ← bi (Ai−1 , Ai ) ∧ bn (An−1 , c′ )
i=1
We refer to rules of these types as path rules because the body atoms (the part
after the ← symbol) form a path. Note that this also includes variants of rules where
the variables are reversed within the atoms. Given a knowledge graph Gknow , a path
of length n is a sequence of n triples pi (ci , ci+1 ) where either pi (ci , ci+1 ) ∈ Gknow or
pi (ci+1 , ci ) ∈ Gknow , with 0 ≤ i ≤ n. The abstract rule patterns presented above are
10
considered to have length n, as their body atoms can be instantiated into a path of
length n − 1. For example, in Figure 3.1 1 , when sampling paths of length 3, we can
obtain the following two rules: the rule marked in green and the rule marked in red.
In addition, rules of type B and Uc are also referred to as closed-path rules. These
are utilized by the AMIE model, described in [13, 14]. Rule Ud is considered an open
rule or an acyclic path rule, since An is a variable that appears only once. For example:
11
states that a person X is female if they are married to person A and person A is male.
In rule (3), there is no cycle formed in the graph, unlike in rule (2), where node Y is
repeated both in the head atom and as the final node in the body atoms.
Rule (4) is a Ud -type rule, which states that a person X is an actor if they acted
in a movie A.
All rules under consideration are filtered based on a score called the rule’s confi-
dence, which is computed on the training dataset. This confidence score is defined as
the number of body atom paths that lead to the head atom, divided by the total number
of paths that contain only the body atoms.
For example, consider the following rule: gen(X, f emale) ← married(X, A), gen(A, male).
We first count all entity pairs that satisfy the relations married(X, A), gen(A, male)—this
is the number of paths containing the body atoms. Then we count how many of those
pairs also satisfy the inferred relation gen(X, f emale); this is the number of body atom
paths that lead to the head atom. The confidence score of the rule is the ratio of the
latter to the former.
12
3.3.1 AnyBURL
2: n=2
3: R=∅
4: loop
5: Rs = ∅
6: start = currentT ime()
7: repeat
8: p = sampleP ath(Gknow , n)
9: Rp = generateRules(p)
10: for r ∈ Rp do
11: score(r, s)
12: if Q(r) then
13: Rs = Rs ∪ {r}
14: until currentT ime() > start + ts
15: Rs′ = Rs ∩ R
16: if | R′ | / | R |> SAT then
17: n=n+1
18: R = Rs ∩ R
return R
The input of the algorithm consists of Gknow , S, SAT, Q, T S. The output is the set R
of learned rules. Here, Gknow is a knowledge graph derived from the training dataset.
S is a parameter indicating the sample size used during each sampling iteration on
the training data for confidence computation. SAT denotes the saturation level of the
rules generated in each iteration; this saturation is calculated based on the number of
new rules learned in the current iteration relative to the total number of rules already
learned. If this value is below the saturation threshold, we consider that there is still
potential to discover rules of length n. Otherwise, we increase the rule length and
continue the rule mining process. Q is a threshold used to determine whether a newly
generated rule should be added to the result set. T S indicates the total learning time
of the algorithm.
13
We start with n = 2, which corresponds to rules of path length 2, since a valid
path rule requires at least one literal in the head atom and one in the body atoms. In
the rule sampling step (samplePath), we simply select a random node in the graph,
traverse all possible paths from that node with length n, and then randomly select one
of the traversed paths.
2: generalizations = ∅
3: is binary rule = random.choices([true, f alse])
4: if is binary rule then
5: replace all head by variables(p)
6: replace all tail by variables(p)
7: add(generalizations, p)
8: else:
9: replace all head by variables(p)
10: add(generalizations, p)
11: replace all tail by variables(p)
12: add(generalizations, p)
return generalizations
In this algorithm, we substitute constants into the head and tail of all path rules
from the sampled rule in the previous step if the rule to be learned is a binary rule.
Otherwise, we substitute either the head or the tail and then add the rule to the return
set. We then sample a set of rules from the training set and compute their confidence
scores as described in Subsection 3.3.2. To reduce computational cost, we choose to
sample from the training set for this calculation.
When making predictions for rule candidates, we recompute confidence by incorpo-
rating an estimated number of incorrect rules not observed during sampling. For our
model, after experimenting with the parameter in the range [5, 10], we found that this
yields the best results.
14
3.4 Extended AnyBURL Algorithm
This algorithm is our proposed extension to avoid retraining the entire model when a
new set of knowledge is added to the graph. When new knowledge is added, we first
check whether it is connected to the existing knowledge in the graph (i.e., connectivity).
If it is, we perform the ⊕ operation by combining all elements in batch edge with the
15
connected components in the graph, up to a path length of 5. If there is no connectivity,
we use all elements in batch edge and repeat the steps of the Anytime Bottom-up Rule
Learning algorithm.
Algorithm 4 EdgeAnyBURL
1: procedure EdgeAnyBURL(Gknow , s, sat, Q, ts, edge)
16
Chapter 4. DEEP LEARNING-BASED METHODS
In this chapter, we present Knowledge Graphs and describe the task of Graph Embed-
ding, providing an overview of current graph embedding techniques. We will revisit the
attention mechanism and explain how it is applied to knowledge graphs through the
Graph Attention Network (GAT) model [47]. Additionally, we present an improved
method based on the graph attention model - KBGAT [32] - which incorporates relation
information and neighboring relations.
−
e−−−→
Trump = [1.9height , 0area , 1wife is Melania , 0wife is Taylor ].
For features that cannot be measured or have no value (e.g., .area), we assign
0. For categorical features without magnitude (e.g., .wife), we represent them using
probabilities of unit features (e.g., .wife is Melania, .wife is Taylor). Therefore,
any object in the real world can be embedded as a vector in an interpretable way.
To understand graph embedding techniques, we begin with several fundamental
definitions:
Two vertices are more similar if they are connected by an edge with a higher
(1)
weight. Thus, the first-order proximity between vi and vj is denoted as sij =
17
h i
(1) (1) (1) (1)
Ai,j . Let si = si1 , si2 , . . . , si|V | represent the first-order proximities between
vi and other vertices.
Using the graph in Figure 2.1 as an example, the first-order proximity between
(1) (1)
v1 and v2 is the weight of edge e12 , denoted as s12 = 1.2. The vector s1 records
the edge weights connecting v1 to all other vertices in the graph, i.e.,
h i
(1)
s1 = 0, 1.2, 1.5, 0, 0, 0, 0, 0, 0 .
(2)
Definition 6 (Second-Order Proximity) Second-order proximity sij between
vertex vi and vj is defined as the similarity between vi′ ’s first-order neighborhood
(1) (1)
vector si and vj′ ’s vector sj .
(2)
For example, in Figure 2.1, the second-order proximity s12 is the similarity be-
(1) (1)
tween s1 and s2 . As introduced above:
h i h i
(1) (1)
s1 = 0, 1.2, 1.5, 0, 0, 0, 0, 0, 0 , s2 = 1.2, 0, 0.8, 0, 0, 0, 0, 0, 0 .
Higher-order proximities can be defined similarly. For example, the k-th order
(k−1) (k−1)
proximity between vi and vj is the similarity between si and sj .
18
Graph embedding is the process of transforming graph features into vectors or sets of
low-dimensional vectors. The more effective the embedding, the higher the accuracy in
subsequent graph mining and analysis tasks. The biggest challenge in graph embedding
depends on the problem setting, which includes both the embedding input and output,
as illustrated in Figure 4.1.
Graph Embedding
Based on the embedding input, we categorize the surveyed methods in [7] as follows:
Homogeneous Graph, Heterogeneous Graph, Graph with Auxiliary Information, and
Graph Constructed from Non-relational Data.
Different types of embedding inputs preserve different information in the embedding
space and therefore pose different challenges for the graph embedding problem. For
example, when embedding a graph with only structural information, the connections
between nodes are the primary target to preserve. However, for graphs with node labels
or entity attribute information, auxiliary information provides additional context for
the graph and can therefore also be considered during the embedding process. Unlike
embedding input, which is fixed and provided by the dataset, the embedding output
is task-specific.
For instance, the most common embedding output is node embedding, which
represents each node as a vector that reflects similarity between nodes. Node embed-
dings are beneficial for node-related tasks such as node classification, node clustering,
etc.
However, in some cases, the tasks may involve more fine-grained graph components
such as node pairs, subgraphs, or the entire graph. Therefore, the first challenge of
19
embedding is to determine the appropriate type of embedding output for the application
of interest. Four types of embedding outputs are illustrated in Figure 2.1, including:
Node Embedding (4.2), Edge Embedding (4.3), Hybrid Embedding (4.4), and
Whole-Graph Embedding (4.5). Different output granularities have distinct criteria
and present different challenges. For example, a good node embedding retains similarity
with its neighbors in the embedding space. Conversely, a good whole-graph embedding
represents the entire graph as a vector that preserves graph-level similarity.
Figure 4.2: Node embedding with each vector representing node features
Node embedding represents each node as a low-dimensional vector. Nodes that are
close in the graph have similar vector representations. The difference among graph em-
bedding methods lies in how they define the closeness between two nodes. First-order
proximity (Definition 5) and second-order proximity (Definition 6) are two commonly
used metrics to measure pairwise node similarity. Higher-order proximity has also been
explored to some extent. For example, capturing k-step (k = 1, 2, 3, ···) neighborhood
relationships during embedding is discussed in the study by Cao, Shaosheng [8].
Edge Embedding
20
Figure 4.3: Edge embedding with each vector representing edge features
21
e.g., node + edge (i.e., substructure), or node + parts. Substructure or part embeddings
can also be derived by aggregating individual node and edge embeddings. However,
such indirect approaches are not optimized to capture the structure of the graph.
Moreover, node and part embeddings can reinforce each other. Node embeddings
improve by learning from high-order neighborhood attention, while part embeddings
become more accurate due to the collective behavior of their constituent nodes.
Whole-Graph Embedding
Whole-graph embedding is typically used for small graphs such as proteins, molecules,
etc. In this case, an entire graph is represented as a vector, and similar graphs are em-
bedded close to each other. Whole-graph embedding is useful for graph classification
tasks by providing a simple and effective solution for computing graph similarity. To
balance embedding time (efficiency) and information retention (expressiveness), hi-
erarchical graph embedding [31] introduces a hierarchical embedding framework.
It argues that accurate understanding of global graph information requires processing
substructures at multiple scales. A graph pyramid is formed where each level is a
coarsened graph at a different scale. The graph is embedded at all levels and then
concatenated into a single vector. Whole-graph embedding requires collecting features
from the entire graph, thus is generally more time-consuming than other settings.
22
Graph Embedding Techniques
Deep Learning Matrix Factorization Edge Reconstruction Core Graph Generative Models
Maximizing edge
Laplacian Eigen- Latent space-based
With random walk reconstruction Graphlet-based
maps embedding
probability
Random walk-
Minimizing based
margin-based
ranking loss
sible. The main differences among embedding techniques lie in how they define which
intrinsic graph properties should be preserved. Since the primary focus of our work is
on graph embedding methods based on deep learning, we provide only brief overviews
of the other method categories.
Deep Learning
In this section, we present in detail the research directions of deep learning tech-
niques, including: using random walks and not using random walks. Deep learning
techniques are widely used in graph embedding due to their speed and efficiency in au-
tomatically capturing features. Among these deep learning-based methods, three types
of input-based graph settings (excluding graphs constructed from non-relational data)
and all four output types (as shown in Figure 4.1) can adopt deep learning approaches.
Deep learning techniques with random walks
In this category, the second-order proximity (Definition 6) in the graph is preserved
in the embedding space by maximizing the probability of observing a vertex’s neigh-
borhood conditioned on its embedding vector. The graph is represented as a set of
samples obtained via random walks, and then deep learning methods are applied to
ensure the structural properties (i.e., path-based information) are preserved. Repre-
sentative methods in this group include: DeepWalk [36], LINE [41], Node2Vec [17],
Anonymous Walk [21], NetGAN [4], etc.
Deep learning techniques without random walks
This approach applies multi-layer learning structures effectively and efficiently to
transform the graph into a lower-dimensional space. It operates over the entire graph.
23
Several popular methods have been surveyed and presented in [38], as follows:
This model uses multiple convolutional layers: each layer performs convolution
on the input data using low-dimensional filters. The result is a feature map,
which then passes through a fully connected layer to compute the probability
values. For example, ConvE [11]: each entity and relation is represented as
a low-dimensional vector (d-dimensional). For each triple, it concatenates and
reshapes the head h and relation r embeddings into a single input [h, r] with
shape dm × dn . This is then passed through a convolutional layer with a filter ω
of size m×n, followed by a fully connected layer with weights W . The final result
is combined with the tail embedding t using a dot product. This architecture can
be considered a multi-class classification model.
Another popular model is ConvKB [33], which is similar to ConvE but concate-
nates the three embeddings h, r, and t into a matrix [h, r, t] of shape d × 3. It is
then passed through a convolutional layer with T filters of size 1 × 3, resulting
in a feature map of size T × 3. This is further passed through a fully connected
layer with weights W. This architecture can be considered a binary classification
model.
These models apply one or more recurrent layers to analyze the entire path (a
sequence of events/triples) sampled from the training set, instead of treating
each event independently. For example, RSN [18] notes that traditional RNNs
are unsuitable for graphs because each step only takes the relation information
without considering the entity embedding from the previous step. Therefore, it
fails to clearly model the transitions among entity-relation paths. To address
this, they propose RSN (Recurrent Skipping Networks [18]): at each step, if the
input is a relation, a hidden state is updated to reuse the entity embedding. The
output is then dot-multiplied with the target embedding vector.
Capsule networks group neurons into ”capsules” , where each capsule encodes
specific features of the input, such as representing a particular part of an image.
24
One advantage of capsule networks is their ability to capture spatial relationships
that are lost in conventional convolution. Each capsule produces feature vectors.
For instance, CapsE [48]: each entity and relation is embedded into vectors,
similar to ConvKB. It concatenates the embeddings h, r, and t into a matrix
of shape d × 3, then applies E convolutional filters of size 1 × 3, resulting in
a d × E matrix. Each i-th row encodes distinct features of h[i], r[i], and t[i].
This matrix is then fed into a capsule layer, where each capsule (4.1.2) processes
a column, thus receiving feature-specific information from the input triple. A
second capsule layer is used to produce the final output.
This category uses the attention mechanism [46], which has achieved notable
success in NLP tasks. For each embedding vector, information from neighboring
entities is aggregated using attention weights. These are then combined and
passed through a fully connected layer with learnable weights to obtain the final
embeddings. For example, GAT [47] applies multi-head attention to each training
triple to generate an embedding vector. This embedding is then transformed via a
weight matrix to produce a higher-dimensional vector that aggregates information
from neighboring nodes in the original triple. An improved version, KBGAT
[32], incorporates the relation embedding into the attention mechanism. These
methods will be discussed in detail in the subsequent sections.
Other methods
There are also other approaches, such as autoencoder-based techniques like Struc-
tural Deep Network Embedding (SDNE) [49].
Matrix Factorization
Matrix factorization-based graph embedding represents the structural characteris-
tics of a graph (e.g., similarity or proximity between vertex pairs) in the form of a
matrix and then factorizes this matrix to obtain vertex embeddings. The input for
this category of methods is typically high-dimensional non-relational features, and the
output is a set of vertex embeddings. There are two matrix factorization-based graph
embedding methods: Graph Laplacian Eigenmaps and Node Proximity Matrix Factor-
ization.
25
Graph Laplacian Eigenmaps
This approach preserves graph properties by analyzing similar vertex pairs and
heavily penalizes embeddings that place highly similar nodes far apart in the
embedding space.
Edge Reconstruction
The edge reconstruction method builds edges based on the vertex embeddings so
that the reconstructed graph is as similar as possible to the input graph. This method
either maximizes the edge reconstruction probability or minimizes edge reconstruction
loss. Additionally, the loss can be distance-based or margin-based ranking loss.
In this approach, the edges in the input graph represent correlations between
vertex pairs. Some vertices in the graph are often linked with related vertex sets.
This method ensures that embeddings of related nodes are closer together than
unrelated ones by minimizing a margin-based ranking loss.
26
Graph Kernels
Graph kernel methods represent the entire graph structure as a vector containing
counts of basic substructures extracted from the graph. Subcategories of graph kernel
techniques include: graphlets, subtree patterns, and random walk-based methods.
This approach is designed to embed whole graphs, focusing only on global graph
features. The input is typically homogeneous graphs or graphs with auxiliary informa-
tion.
Generative Models
A generative model is defined by specifying a joint distribution over input features
and class labels, parameterized by a set of variables. There are two subcategories of
generative model-based methods: embedding graphs into latent space and incorporat-
ing semantics for embedding. Generative models can be applied to both node and edge
embeddings. They are commonly used to embed semantic information, with inputs
often being heterogeneous graphs or graphs with auxiliary attributes.
In this group, vertices are embedded into a latent semantic space where the
distance between nodes captures the graph structure.
In this method, each vertex is associated with graph semantics and should be
embedded closer to semantically relevant vertices. These semantic relationships
can be derived from descriptive nodes via a generative model.
Summary: Each graph embedding method has its own strengths and weaknesses,
which have been summarized by Cai, Hongyun [7] and are presented in Table 4.1. The
matrix factorization group learns embeddings by analyzing pairwise global similarities.
The deep learning group, in contrast, achieves promising results and is suitable for
graph embedding because it can learn complex representations from complex graph
structures.
27
Table 4.1: Comparison of Advantages and Disadvantages of Graph Embedding Tech-
niques
Method Subcategory Advantages Disadvantages
Matrix Factor- Graph Laplacian Considers global Requires large
ization Eigenmaps neighborhood computation time
Node Proximity structure and space
Matrix Factoriza-
tion
Maximize edge re- Only optimizes
Edge Relatively efficient
construction prob- local information,
Reconstruction training
ability e.g., first-order
Minimize distance- neighbors or
based loss ranked node pairs
Minimize margin-
based ranking loss
Based on graphlet Efficient, considers Substructures are
Graph Kernels Based on subtree only desired not independent.
patterns primitive structures Embedding
Based on random dimensionality
walk increases
Embedding is inter- exponentially
Generative Embedding into la- Difficult to tune
pretable
Model tent space distribution selec-
tion
Leverages multiple
Incorporate se- Requires a large
information sources
mantics amount of natu-
rally labeled data
Deep Learning With random walk Efficient and fast. Considers only
No need for manual local path content.
feature extraction Difficult to find
optimal sampling
strategy
28
Method Subcategory Advantages Disadvantages
Without random High computa-
walk tional cost
Graph embedding methods based on random walk in deep learning have lower com-
putational cost compared to those using full deep learning models. Traditional methods
often treat graphs as grids; however, this does not reflect the true nature of graphs. In
the edge reconstruction group, the objective function is optimized based on observed
edges or by ranking triplets. While this approach is more efficient, the resulting em-
bedding vectors do not account for the global structure of the graph. The graph kernel
methods transform graphs into vectors, enabling graph-level tasks such as graph clas-
sification. Therefore, they are only effective when the desired primitive structures in
a graph can be enumerated. The generative model group naturally integrates infor-
mation from multiple sources into a unified model. Embedding a graph into a latent
semantic space produces interpretable embedding vectors using semantics. However,
assuming the modeling of observations using specific distributions can be difficult to
justify. Moreover, generative approaches require a large amount of training data to
estimate a model that fits the data well. Hence, they may perform poorly on small
graphs or when only a few graphs are available.
Among these methods, deep learning-based graph embedding allows learning com-
plex representations and has shown the most promising results. Graph attention net-
works (GATs), which are based on attention mechanisms, aggregate the information
of an entity using attention weights from its neighbors relative to the central entity.
We believe this research direction is aligned with studies on the relationship between
attention and memory [34], where the distribution of attention determines the weight
or importance of one entity relative to another. Likewise, the embedding vector rep-
resenting an entity is influenced by the attention or importance of its neighboring
embeddings. Therefore, this is the approach we selected among the existing graph
embedding methods.
29
by Vaswani, Ashish [46]. The attention mechanism is an effective method to indicate
the importance of a word with respect to other words in a sentence, and it has been
shown to be a generalization of any convolution operation as reported by Cordonnier,
Jean-Baptiste [9]. To understand how multi-head attention is applied to graphs, in
this section we will detail the mechanism so we can better understand how it is used
in link prediction tasks within knowledge graphs.
The attention mechanism transforms input vectors of Din dimensions into output
vectors of Dattention dimensions to represent the importance of each of the Nx elements
x with respect to all Ny elements y. Given X ∈ RNx ×Din and Y ∈ RNy ×Din as input
embedding matrices, and H ∈ RNx ×Dattention as the output embedding matrix, the
attention mechanism introduced by Vaswani et al. [46] is defined as follows:
QKT
H = Attention(Q, K, V) = softmax √ V (4.1)
dk
where Q = XWQ , K = YWK , V = YWV .
The weight matrices WQ ∈ RDin ×Dk , WK ∈ RDin ×Dk , and WV ∈ RDin ×Dattention
are used to parameterize the transformation from input embedding vectors of dimension
Din into output embedding vectors of dimension Dk or Dattention . The term QKT
√
represents the dot product between each vector x and all vectors y. Dividing by dk
normalizes the result with respect to the vector dimension k. The result is then passed
through the softmax function to enable comparison of attention scores across different
T
QK
pairs. We can interpret softmax √
dk
as the attention coefficients, indicating the
importance of each y with respect to each x. Finally, this is multiplied with the value
matrix V to produce the final output embedding of dimension Dattention .
If X = Y, we are computing the importance of each element with respect to other
elements in the same input matrix, which is referred to as the self-attention mechanism.
30
4.2.2 Multi-Head Attention
The multi-head attention mechanism is a way of combining multiple attention layers to
stabilize the learning process. Similar to the standard attention mechanism above, the
multi-head attention mechanism transforms the initial Nx embedding vectors of dimen-
sion Din into output embedding vectors of dimension Dmulti-head , aggregating informa-
tion from various other nodes to provide greater stability during training. Multi-head
attention stacks Nhead attention output matrices H, and then applies a weight ma-
trix to transform the original embedding matrix X ∈ RNx ×Din into a new embedding
matrix X′ ∈ RNx ×Dmulti-head using the following formula:
Nhead
!
n
X′ = H(h) WO
h=1
! (4.2)
Nhead
(h) (h) (h)
n
= Attention(XWQ , YWK , YWV ) WO
h=1
(h) (h) (h)
Here, the weight matrices WQ , WK ∈ RDin ×Dk , and WV ∈ RDin ×Dattention
correspond to each individual attention head h ∈ [Nhead ]. The output projection matrix
WO ∈ RNhead Dattention ×Dmulti-head parameterizes the transformation of the concatenated
output heads into the final output embedding matrix.
At this point, we have presented the attention mechanism for computing attention
scores and aggregating embedding information from neighboring vectors. In the next
section, we will describe how this attention mechanism is applied to knowledge graphs.
31
Donald Jeff
of Trump Bezos
wife pre f
Melania sid
ent to
es
Trump first lad αi
4
of ir ch i1 e1
r1 e2 e4
y α r2
r7 r3
αi5
U.S
→ e3
f r8 r6
native o fo r5
un r4
αi6 of de e5
Tom te d e6 e7
sta α in
born i2
Cruise in α i3
New
Tesla Inc.
York
Figure 4.7: Knowledge graph and normalized attention coefficients of the entity
n o
E = →
−
e1 , →
−
e2 , ..., −→ . The objective of the model is to transform this into a new
eN e
n→− →− −→o
output embedding matrix E′′ = e′′1 , e′′2 , ..., e′′Ne capable of aggregating embedding
′′
information from neighboring entities. Here, E ∈ RNe ×Din and E′′ ∈ RNe ×D denote
the input and output embedding matrices for the entity set, respectively. Ne is the
number of entities, and Din , D′′ are the dimensions of the input and output embeddings.
Similar to the multi-head attention mechanism introduced in 4.2.1, the application
of this mechanism to a knowledge graph follows the same logic as the self-attention
mechanism, in which each node attends to all other nodes in the graph. However,
computing attention scores between every pair of nodes in a graph is not meaningful
if no relationship exists between them, and it would incur significant computational
overhead. Therefore, the model applies a technique known as masked attention, in
which all attention scores corresponding to unrelated nodes in the graph are ignored.
These relevant connections are precisely defined as the first-order proximity (Definition
5) of a node in the graph. Thus, in this context, we let X = Y = E (as in 4.2.1), and
the attention coefficient in the masked attention mechanism represents the importance
of a neighboring node j ∈ Ni to the central node i, where Ni is the set of all first-order
neighbors of node i (including i itself).
The application of the multi-head attention mechanism (multi-head attention) in
4.2 to graphs is described as follows:
where eij denotes the multi-head attention coefficient of an edge (ei , ej ) with respect
to the central entity ei in the knowledge graph Gknow . W is a weight matrix that
32
parameterizes the linear transformation. fmask attention is the function applying the
attention mechanism.
In the GAT model, each entity vector embedding →
−
ei undergoes two transformation
stages. The entire model consists of two transformation steps, each applying the multi-
head attention mechanism as follows:
fmask attention →
− fmask attention →
−
(1) (2)
→
−
ei −−−−−−−−→ e′i −− −−−−−−→ e′′i (4.4)
(1)
In the first multi-head attention step (fmask attention ), the model aggregates informa-
→
− →
− ′
tion from neighboring entities and stacks them to produce vector e′i , where e′i ∈ R1×D .
(2)
In the second step (fmask attention ), the multi-head attention layer is no longer sensitive
to self-attention; therefore, the output is computed as an average instead of concate-
→
−
nating attention heads. The vector e′i is then treated as the input embedding to be
→
− →
− ′′
transformed into the final output embedding vector e′′i , with e′′i ∈ R1×D .
First, similar to the attention mechanism in 4.1, each embedding vector is multiplied
by a weight matrix W1 ∈ RDk ×Din to parameterize the linear transformation from Din
input dimensions to Dk higher-level feature dimensions:
→
−
hi = W1 →
−
ei (4.5)
→
−
where →
−
ei ∈ RDin ×1 →
− hi ∈ RDk ×1
Next, we concatenate each pair of linearly transformed entity embedding vectors to
compute the attention coefficients. The attention coefficient eij reflects the importance
of the edge feature (ei , ej ) with respect to the central entity ei , or in other words, the
importance of a neighboring entity ej that is connected to ei . We apply the LeakyReLU
function to extract the absolute value of the attention coefficient. Each attention
coefficient eij is computed using the following equation:
−−→ →
T − →
−
eij = LeakyReLU W2 [ hi ||hj ] (4.6)
33
To enable meaningful comparison between the attention coefficients of neighboring
entities, a softmax function is applied to normalize the coefficients over all neighbors
ej that are connected to the central entity ei : αij = softmaxj (eij ). Combining all of
this, we obtain the final normalized attention coefficient of each neighbor with respect
to the central entity as follows:
−−→ →
T − →
−
exp LeakyReLU W2 [ hi ||hj ] )
αij = P −−→ →− −
→ (4.7)
T
k∈Ni exp LeakyReLU W2 [ hi ||hk ] )
At this stage, the GAT model operates similarly to GCN [23], where the embed-
ding vectors from neighboring nodes are aggregated and scaled by their corresponding
normalized attention coefficients:
→
−′ X → −
ei = σ αij hj (4.8)
j∈Ni
34
will present our complete embedding model based on the KBGAT model proposed by
Nathani, Deepak[32].
First, the embedding vectors of each entity are initialized using the TransE model
to capture the spatial characteristics among nodes and obtain the initial embeddings.
These embeddings are then further trained using an encoder model to capture neigh-
borhood features, resulting in updated embeddings. Finally, these embeddings are
passed through a prediction layer using the ConvKB model. All equations presented
here are based on those in the work of Nathani, Deepak[32].
Initially, the entity and relation embedding vectors are randomly initialized using
a normal distribution with dimensionality Din , and then normalized according to the
size of the entity and relation embedding sets.
35
−−−−−→
y tailinvalid
−−−−→
tailvalid
−−−−−→
relation
−−−−−−→
headinvalid
−−−−−→
headvalid
x
Next, we perform sampling from the training dataset to obtain a batch of valid
triples (Sbatch ). For each such triple, we sample an invalid triple by replacing either
the head or the tail entity with a random entity from the entity set, yielding a batch
′
of invalid triples (Sbatch ). We then pair each valid triple with an invalid one to form
the training batch (Tbatch ). Finally, we update the embedding vectors to satisfy the
condition in 4.12.
36
Algorithm 5 TransE Embedding Learning Algorithm [5]
Input : Training set S = (h, r, t), entity set E, relation set R, margin γ, embedding
dimension Din
Initialize
1:
→
−r ← uniform(− √D 6 6
, √D ) for each relation r ∈ R
in in
→
− −
→r
2: r ← ∥−
→
r∥
for each r ∈ R
3:
→
− 6
e ← uniform(− √D 6
, √D ) for each entity e ∈ E
in in
4: loop
→
− −
→
e
5: e ← ∥−
→
e∥
for each e ∈ E
6: Sbatch ← sample(S, b) // sample minibatch of size b
7: Tbatch ← ∅
8: for (h, r, t) ∈ Sbatch do
9: (h′ , r, t′ ) ← sample(S(h,r,t)
′ ) // sample from invalid triple set
n o
10: Tbatch ← Tbatch ∪ (h, r, t), (h′ , r, t′ )
Update embeddings
→
− − → − →
− − → −
∇[γ + d( h + →
r , t ) − d( h′ + →
r , t′ )]+
P
11:
(h,r,t),(h′ ,r,t′ ) ∈Tbatch
Output : A set of embedding vectors with dimension Din representing entities and
relations
′
= (h′ , r, t)|h′ ∈ E ∪ (h, r, t′ )|t′ ∈ E
S(h,r,t) (4.13)
→
− − →
−
To achieve the goal of learning embedding vectors such that h + →
r ≈ t , the model
37
→
− →
− −
aims for the tail embedding t of valid triples to lie close to h + →r , while in invalid
→
−′ → →
− →
− →
− −
triples, the corrupted embedding h + − r (or t′ ) should lie far from t (or h + → r ),
according to the following margin-based ranking loss function:
X X
L= [d − d′ + γ]+ (4.14)
(h,r,t)∈S ′
(h′ ,r,t′ )∈S(h,r,t)
Here, γ > 0 is the margin, and h′ and t′ are entities sampled as defined in Equation
4.13. ∆ and ∆′ represent the difference vectors for the embeddings in valid and invalid
→
− − → − →
− −
triples, respectively, with d = ∆ 1 = h + → r − t 1 and d′ = ∆′ 1 = h′ + → r −
→
−′
t 1 , where ∥ · ∥1 denotes the L1 norm.
→
− − →
−
As illustrated in Figure 4.9, if d > d′ or d − d′ > 0, then h + →
r is closer to t than
→
−
to t′ . Since we want the embedding vectors to satisfy the condition in Equation 4.12,
→
− → →
− →
− − →
−
h +− r should be as close as possible to t . That means, the closer h + → r is to t′ ,
the more incorrect it becomes.
Therefore, during training, we aim for ∆′ to be as large as possible relative to ∆.
If ∆′ > ∆ or d′ − d > 0, there is no need to update the embedding weights further.
Hence, in the loss function 4.14, the term [d − d′ + γ]+ captures only the positive part
because the negative part already satisfies the correctness of the condition in Equation
4.12 during training.
38
4.4.2 Encoder Model
After obtaining the embedding vectors that capture spatial features of the knowledge
graph, these embeddings are passed through the next embedding layer to further ag-
gregate the neighborhood information of each entity.
Attention
Head 1 N4.21
Triple 1
Donald Trump
Triple N
Tom Cruise
Attention
born in
Head 2
New York
n o n→
−′′ →
−′′ −′′→o
E= →
−
e1 , →
−
e2 , ..., −→
eN →
− E =′′
e1 , e2 , ..., eNe ,
e
′′
with E ∈ RNe ×Din and E′′ ∈ RNe ×D .
Simultaneously, it transforms the relation embedding matrix
n o n→
−′′ →
−′′ −→ o
R= →
−
r1 , →
−
r2 , ..., −→
rN →
− R =′′ ′′
r1 , r2 , ..., rNr ,
r
′′
with R ∈ RNr ×Pin and R′′ ∈ RNr ×P .
Similar to the GAT model described in Section 4.3, the model transforms the entity
embedding vectors from Din dimensions to D′′ dimensions by aggregating neighborhood
information through attention coefficients. Pin and P ′′ are the input and output di-
mensions of the relation embedding vectors, respectively. Ne and Nr are the sizes of
the entity and relation sets in Gknow , respectively.
The KBGAT model concatenates the entity and relation embedding vectors accord-
ing to the following structure:
39
−→
tijk = W1 [→
−
ei ||→
−
ej ||→
−
rk ] (4.15)
−→
Here, tijk is the embedding vector representing the triple tkij = (ei , rk , ej ), where ej
and rk are the neighboring entity and the relation connecting the source node ei to the
node ej . W1 ∈ RDk ×(2Din +Pin ) is a weight matrix that performs a linear transformation
of the concatenated input vectors into a new vector with dimensionality Dk . These
weight matrices are either randomly initialized using a normal distribution or pre-
trained using the TransE model [5].
Similar to Equation 4.7 in the GAT model, we need to compute the attention
coefficient for each edge with respect to a given node. Then, the softmax function is
applied to normalize these coefficients as follows:
−→
αijk = softmaxjk (LeakyReLU(W2 tijk ))
−→
exp LeakyReLU W2 tijk (4.16)
=P
−→
P
n∈Ni r∈Rin exp LeakyReLU W t
2 inr
where Ni denotes the set of neighbors of the central node ei within nhop hops; Rin
represents the set of all relations that exist along the paths connecting the source entity
−→
ei to a neighboring entity en ∈ Ni . Similar to Equation 4.8, the embedding vectors tkij
are scaled by their corresponding normalized attention coefficients:
→
−′ X X −→
ei = σ αijk tijk (4.17)
j∈Ni k∈Rij
40
′
R′ = RWR ; where: WR ∈ RP ×P (4.19)
′ ′
At this stage, we have obtained two matrices: H′ ∈ RNe ×D and R′ ∈ RNr ×P ,
which are the updated entity and relation embedding matrices, respectively, with new
dimensions. The model proceeds through the final attention layer, taking as input the
newly updated entity and relation embeddings as shown in Equation 4.10. However, if
we apply multi-head attention at this final layer for prediction, the concatenation op-
eration will no longer be sensitive to the self-attention mechanism. Therefore, instead
of concatenating, the model averages the outputs and then applies a final non-linear
activation function:
Nhead −−→
→
−′′ 1 X X X ′(h) ′(h)
ei = σ αijk tijk (4.20)
Nhead
h=1 j∈Ni k∈Rij
′(h) ′(h)
where αijk and tijk denote the normalized attention coefficients and the triple
embedding vectors for (ei , rk , ej ) in attention head (h), respectively.
Up to this point, the KBGAT model functions similarly to the GAT model in
Section 4.3, but it additionally incorporates both entity embedding information and
neighbor nodes up to nhop hops. This results in the final entity embedding matrix
′′ ′′
E′′ ∈ RNe ×D and the final relation embedding matrix R′′ ∈ RNr ×P .
However, after the embedding learning process, the final entity embedding matrix
E′′ may lose the initial embedding information due to the vanishing gradient prob-
lem. To address this, the model employs residual learning, by projecting the initial
′′
embedding matrix E through a weight matrix WE ∈ RDin ×D , and then directly
adding it to the final embedding, thus preserving the initial embedding information
during training:
H = WE E + E′′ (4.21)
Finally, the training datasets are sampled to generate valid triples and invalid
triples, similar to the TransE model described above, in order to learn the embed-
ding vectors. However, the distance between embedding vectors is computed using the
L1 norm as follows:
dtij = h⃗i + g⃗k − h⃗j 1
41
Similarly, we train the model using a margin-based loss function:
X X
L(Ω) = max{dt′ij − dtij + γ, 0} (4.22)
tij ∈S t′ij ∈S ′
where γ > 0 is the margin parameter, S is the set of valid triples, and S ′ is the set
of invalid triples, defined as:
h r t
3.ω 1×3
score
Figure 4.11: Illustration of the decoder layers of the ConvKB model with 3 filters
After mapping entities and relations into a low-dimensional space, the model em-
ploys ConvKB [33] to analyze the global features of a triple tijk across each dimension,
thereby generalizing transformation patterns through convolutional layers. The scoring
function with the learned feature mappings is defined as follows:
Ω →
−
→
− →
−
n
m
f (tijk ) = ReLU([ ei , rk , hj ] ∗ ω ) .W (4.24)
m=1
42
where ω m denotes the m-th convolutional filter, Ω is the hyperparameter repre-
senting the number of convolutional layers, ∗ denotes the convolution operation, and
W ∈ RΩk×1 is the linear transformation matrix used to compute the final score for the
triple.
The model is trained using a soft-margin loss function as follows:
X λ
L= log(1 + exp(ltk · f (tkij ))) + ∥ W ∥22 (4.25)
ij 2
tkij ∈{S∪S ′ }
where
1 for tkij ∈ S
ltk =
ij −1 for tk ∈ S ′
ij
The final output of the ConvKB model is the ranking score corresponding to each
prediction.
43
Chapter 5. EXPERIMENTS
1.23 · 105
5
1.2 · 10
1 · 105
80,000
60,000
1,345 237 18 11 37
0
FB15k FB15k-237 WN18 WN18RR YAGO3-10
In this section, we describe the datasets used for our empirical evaluation, along
with a comparison against notable existing methods as reported in Table 4.1. Addi-
tionally, we evaluate our two proposed approaches for injecting new knowledge into
the knowledge graph. Specifically, we treat the test set as a batch of new knowledge
to be added, and use the validation set to re-evaluate the effectiveness of our method.
Detailed results are presented in Table 5.2 and Table 5.3.
# Edges
Dataset Entities Relations Training Validation Test
FB15k 14,951 1,345 483,142 50,000 59,071
FB15k-237 14,541 237 272,115 17,535 20,466
WN18 40,943 18 141,442 5,000 5,000
WN18RR 40,559 11 86,835 3,034 3,134
YAGO3-10 123,182 37 1,079,040 5,000 5,000
44
5.1 Training Datasets
FB15k 592,213
FB15k-237 310,116
WN18 151,442
WN18RR 93,003
YAGO3-10 1,089,040
0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 1,100,000
In our experiments, we evaluate our approach on four widely used benchmark datasets:
FB15k, FB15k-237 ([43]), WN18, and WN18RR ([12]). Each dataset is divided into
three subsets: training, validation, and test sets. Detailed statistics for these datasets
are presented in Table 5.1.
Each dataset consists of a collection of triples in the form ⟨head, relation, tail⟩.
FB15k and WN18 are derived from the larger knowledge bases FreeBase and WordNet,
respectively. However, they contain a large number of inverse relations, which allow
most triples to be easily inferred. To address this issue and to better reflect real-world
link prediction scenarios, FB15k-237 and WN18RR were constructed by removing such
inverse relations.
45
5.1.2 FB15k-237 Dataset
This dataset is a subset of FB15k, constructed by Toutanova and Chen [43], motivated
by the observation that FB15k suffers from test leakage, where models are exposed to
test facts during training. This issue arises due to the presence of duplicate or inverse
relations in FB15k. FB15k-237 was created to provide a more challenging benchmark.
The authors selected facts related to the 401 most frequent relations and eliminated all
redundant or inverse relations. Additionally, they ensured that no entities connected
in the training set were directly linked in the test and validation sets.
34,79634,832
29,715
30,000
20,000
10,000
7,382 7,402
4,816 4,805
2,921 2,935 3,118 3,116
1,299 923 903 1,138
629 80 632
0
rm
rt
ym
ym
ym
on
of
of
of
of
p
e
to
pi
ag
se
ou
pa
ny
ny
ny
gi
rt
e
fo
to
on
on
on
r
us
gr
pi
ag
io
ila
so
re
er
er
lo
pa
s
er
d
to
g
n
ha
us
al
ho
p
rb
te
hy
hy
re
n
ai
m
hy
hy
ai
ai
si
n
la
ve
m
n
r
ai
ce
m
r
be
ai
re
do
ce
ai
be
do
om
do
an
m
em
m
an
lly
em
do
of
st
of
do
d
of
st
m
na
m
in
et
in
et
be
r
et
be
io
ns
be
ns
ns
em
t
em
sy
va
em
sy
sy
m
ri
m
m
de
This dataset was introduced by the authors of TransE [5], and is extracted from
WordNet2 , a lexical knowledge graph ontology designed to provide a dictionary/the-
saurus to support NLP tasks and automated text analysis. In WordNet, entities corre-
spond to synsets (i.e., word senses), and relations represent lexical connections among
them (e.g., “hypernym”).
To construct WN18, the authors used WordNet as a starting point and iteratively
filtered out entities and relations that were infrequently mentioned.
2 https://wordnet.princeton.edu/
46
5.1.4 WN18RR Dataset
This dataset is a subset of WN18, constructed by DeŠmers et al.[11], who also addressed
the issue of test leakage in WN18. To tackle this issue, they created the WN18RR
dataset, which is significantly more challenging, by applying a similar methodology to
that used in FB15k-237 [43].
This metric measures the proportion of correct predictions whose rank is less than or
equal to the threshold K:
| {q ∈ Q : rank(q) ≤ K} |
H@K =
|Q|
This metric calculates the average rank of the correct entity in the prediction. A lower
value indicates better model performance:
1 X
MR = rank(q)
|Q|
q∈Q
Here, | Q | denotes the total number of queries, which equals the size of the test or
validation set. During evaluation, we perform both head and tail entity predictions for
each triple. For example, we predict both ⟨?, relation, tail⟩ and ⟨head, relation, ?⟩.
The variable q denotes a query, and rank(q) indicates the rank position of the correct
entity. The final MR score is the average rank over all head and tail predictions.
Clearly, this metric ranges from [1, |number of entities|], as a node can connect to
at most n − 1 other nodes plus a self-loop. However, this metric is highly sensitive to
outliers, as certain relations may yield extremely low rankings for correct entities. To
47
address this issue, our method—as well as other recent works—also adopts the Mean
Reciprocal Rank (MRR) metric.
This is the Mean Reciprocal Rank (MRR), calculated as the reciprocal of the average
rank obtained for a correct prediction. Higher values indicate better model perfor-
mance. Since this metric takes the reciprocal of each rank, it helps mitigate the noise
sensitivity encountered in the Mean Rank (MR) metric:
1 X 1
M RR =
|Q| rank(q)
q∈Q
48
Python version 3.6, utilizing only built-in Python functions without any third-party
libraries. The experiments were conducted on four widely-used datasets: FB15k, FB15-
237, WN18, and WN18RR. Detailed information about these datasets is provided in
Section 5.1 under the training datasets section.
As described in Algorithm 2, the AnyBURL algorithm learns rules generated within
a user-configurable time interval. Here, we set the training time to 1000 seconds (ap-
proximately 17 minutes), with saturation (SAT) set to 0.85, confidence threshold Q set
1
to 0.05, and sample size S configured as ( 10 of the training set). With this setup, our
Python version of the model produced results comparable to the Java version developed
by Meilicke, Christian et al. [27], which was configured similarly but trained for only
100 seconds. The difference in training time is primarily due to performance differences
between Python and Java. We chose Python because it is the primary language used
in many recent artificial intelligence models and provides convenience for performance
comparison and evaluation with other deep learning methods, most of which are also
implemented in Python.
FB15k FB15k-237
H@1 H@10 MR MRR H@1 H@10 MR MRR
ComplEx 81.56 90.53 34 0.848 25.72 52.97 202 0.349
TuckER 72.89 88.88 39 0.788 25.90 53.61 162 0.352
TransE 49.36 84.73 45 0.628 21.72 49.65 209 0.31
RoteE 73.93 88.10 42 0.791 23.83 53.06 178 0.336
ConvKB 59.46 84.94 51 0.688 21.90 47.62 281 0.305
KBGAT 70.08 91.64 38 0.784 36.06 58.32 211 0.4353
AnyBURL 79.13 82.30 285 0.824 20.85 42.40 490 0.311
49
WN18 WN18RR
H@1 H@10 MR MRR H@1 H@10 MR MRR
ComplEx 94.53 95.50 3623 0.349 42.55 52.12 4909 0.458
TuckER 94.64 95.80 510 0.951 42.95 51.40 6239 0.459
TransE 40.56 94.87 279 0.646 2.79 94.87 279 0.646
RoteE 94.30 96.02 274 0.949 42.60 57.35 3318 0.475
ConvKB 93.89 95.68 413 0.945 38.99 50.75 4944 0.427
KBGAT 35.12 57.01 1974 0.4301
AnyBURL 93.96 95.07 230 0.955 44.22 54.40 2533 0.497
Table 5.2 and Table 5.3 describe our experimental results with the H@K metrics
along with the experimental results of other methods mentioned in the survey [38].
Table 5.4: Accuracy results of the two new knowledge addition strategies
Table 5.4 describes our experimental results for the two strategies of adding new
knowledge into the graph. We evaluate the total number of generated rules, as well as
the number of rules with confidence >= 50% and >= 80%.
50
Batch edge AnyBURL Edge AnyBURL
num rule 1011 1367
FB15k confidence 50% 416 (41,14%) 1185 (86,69%)
confidence 80% 284 (28, 09%) 481 (35,18%)
num rule 1120 756
FB15k-237 confidence 50% 244 (21,79%) 660 (87,30%)
confidence 80% 95 (8,48%) 162 (21,43%)
num rule 533 260
WN18 confidence 50% 270 (38, 46 %) 252 (96,92%)
confidence 80% 240 (34,19%) 225 (86,54%)
num rule 439 106
WN18RR confidence 50% 110 (25,05%) 102 (96,22%)
confidence 80% 83 (18,91%) 85 (81,19%)
Table 5.5: Evaluation results on the number of rules of the two new knowledge addition
strategies
51
Chapter 6. CONCLUSION
In this section, we presented the results achieved by the proposed model, along with
detailed analyses on different datasets to clarify both the strengths and the remaining
limitations. From this, we identified potential research directions for improving the
model in the future.
Although our rule-based method demonstrates performance comparable to mod-
ern deep learning models (state-of-the-art), and clearly outperforms them in terms
of training time—only about 17 minutes compared to several hours for deep learning
models—this does not imply that deep learning models are not worth studying. On
the contrary, through performance analysis across different datasets, we observed that
for datasets with diverse relations like FreeBase, the KBGAT model using attention
mechanisms yielded significantly better results than on datasets like WordNet, which
contain fewer relation types. This highlights the potential of leveraging deep learning
mechanisms tailored to the specific characteristics of each dataset.
This shows that the attention mechanism, by incorporating relational embedding
information, helps to better capture graph structures in datasets with a wide variety
of relations. For datasets with many similar and inverse triples such as FB15k and
WN18RR, the rule-based model AnyBURL achieved superior results, whereas deep
learning methods only achieved average performance compared to other methods. The
rule-based AnyBURL model performs better on datasets like FB15k and WN18RR;
however, for datasets that have removed similar or inverse information, such as FB15k-
237 and WN18RR, the rule-based method is less effective, since it relies on previously
observed paths or links. In contrast, deep learning models represent relations and
entities in a vector space to learn their interactions, allowing them to perform better
on datasets like FB15k-237 and WN18RR than on FB15k and WN18.
One of the main advantages of the rule-based approach is that the generated rules
are interpretable during training and require significantly less training time compared
to other methods. However, after the training phase, the rule-based method must
52
iterate through all learned rules to make predictions. This is an area where deep
learning models show better performance, as models like KBGAT can use the learned
weights and computational layers to transform inputs into probabilistic predictions
much faster. The drawback of deep learning approaches is their lack of interpretability
during training, as well as the high computational cost. Regarding our two proposed
algorithms for adding new knowledge to the graph, we found them to significantly
outperform deep learning methods.
The graph embedding process helps represent the features of entities, relations, or
the characteristics of the knowledge graph as lower-dimensional vectors (Section 4.1).
However, in practice, a piece of knowledge represented by entities and relations is
entirely independent between these components; thus, they should be embedded into
vectors with different dimensionalities. The ratio of dimensions between entities and
relations is also an important issue that requires further investigation.
Additionally, in the real world, the temporal factor is a critical piece of information
that can completely alter the meaning of a piece of knowledge. Therefore, integrating
temporal information into the attention mechanism is one of the research directions we
aim to pursue to ensure the semantic accuracy of the knowledge graph.
For the rule-based method AnyBURL, the reinforcement learning branch has re-
cently seen significant progress, and the authors Meilicke, Christian and Chekol [26]
have recently proposed a study to optimize the AnyBURL method using reinforcement
learning. We also intend to explore this direction and aim to report our findings in the
near future.
53
REFERENCES
[1] Franz Baader and Tobias Nipkow. Term rewriting and all that. Cambridge uni-
versity press, 1999.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans-
lation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,
2014.
[3] Shumeet Baluja, Rohan Seth, Dharshi Sivakumar, Yushi Jing, Jay Yagnik,
Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. Video suggestion
and discovery for youtube: taking random walks through the view graph. In Pro-
ceedings of the 17th international conference on World Wide Web, pages 895–904,
2008.
[5] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok-
sana Yakhnenko. Translating embeddings for modeling multi-relational data. In
Advances in neural information processing systems, pages 2787–2795, 2013.
[6] Samuel R Buss. An introduction to proof theory. Handbook of proof theory, 137:1–
78, 1998.
[7] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. A comprehen-
sive survey of graph embedding: Problems, techniques, and applications. IEEE
Transactions on Knowledge and Data Engineering, 30(9):1616–1637, 2018.
[8] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representa-
tions with global structural information. In Proceedings of the 24th ACM interna-
54
tional on conference on information and knowledge management, pages 891–900,
2015.
[9] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship
between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584,
2019.
[10] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. Multi-head atten-
tion: Collaborate instead of concatenate. arXiv preprint arXiv:2006.16362, 2020.
[11] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Con-
volutional 2d knowledge graph embeddings. arXiv preprint arXiv:1707.01476,
2017.
[12] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Con-
volutional 2d knowledge graph embeddings. In Thirty-Second AAAI Conference
on Artificial Intelligence, 2018.
[13] Luis Galárraga, Christina Teflioudi, Katja Hose, and Fabian M. Suchanek. Amie:
association rule mining under incomplete evidence in ontological knowledge bases.
In WWW ’13, 2013.
[14] Luis Galárraga, Christina Teflioudi, Katja Hose, and Fabian M Suchanek. Fast rule
mining in ontological knowledge bases with amie. The VLDB Journal, 24(6):707–
730, 2015.
[15] Google. Introducing the Knowledge Graph: things, not strings, 2020 (accessed on
August 27, 2020).
[16] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and
performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
[17] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for net-
works. In Proceedings of the 22nd ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 855–864, 2016.
[18] Lingbing Guo, Zequn Sun, and Wei Hu. Learning to exploit long-term relational
dependencies in knowledge graphs. arXiv preprint arXiv:1905.04914, 2019.
[19] Wilfrid Hodges et al. A shorter model theory. Cambridge university press, 1997.
55
[20] John J Hopfield. Hopfield network. Scholarpedia, 2(5):1977, 2007.
[21] Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. arXiv preprint
arXiv:1805.11921, 2018.
[22] Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and Philip S Yu. A
survey on knowledge graphs: Representation, acquisition and applications. arXiv
preprint arXiv:2002.00388, 2020.
[23] Thomas N Kipf and Max Welling. Semi-supervised classification with graph con-
volutional networks. arXiv preprint arXiv:1609.02907, 2016.
[24] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recog-
nition with gradient-based learning. In Shape, contour and grouping in computer
vision, pages 319–345. Springer, 1999.
[25] Johann A. Makowsky. Why horn formulas matter in computer science: Initial
structures and generic examples. Journal of Computer and System Sciences, 34(2-
3):266–292, 1987.
[26] Christian Meilicke, Melisachew Wudage Chekol, Manuel Fink, and Heiner Stuck-
enschmidt. Reinforced anytime bottom up rule learning for knowledge graph com-
pletion. arXiv preprint arXiv:2004.04412, 2020.
[27] Christian Meilicke, Melisachew Wudage Chekol, Daniel Ruffinelli, and Heiner
Stuckenschmidt. Anytime Bottom-Up Rule Learning for Knowledge Graph Com-
pletion, 2019.
[28] Christian Meilicke, Melisachew Wudage Chekol, Daniel Ruffinelli, and Heiner
Stuckenschmidt. Anytime bottom-up rule learning for knowledge graph comple-
tion. In IJCAI, pages 3137–3143, 2019.
[29] Christian Meilicke, Manuel Fink, Yanjie Wang, Daniel Ruffinelli, Rainer Gemulla,
and Heiner Stuckenschmidt. Fine-grained evaluation of rule-and embedding-based
systems for knowledge graph completion. In International Semantic Web Confer-
ence, pages 3–20. Springer, 2018.
[30] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
56
[31] Seyedeh Fatemeh Mousavi, Mehran Safayani, Abdolreza Mirzaei, and Hoda Ba-
honar. Hierarchical graph embedding in vector space by graph pyramid. Pattern
Recognition, 61:245–254, 2017.
[32] Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar Kaul. Learning
attention-based embeddings for relation prediction in knowledge graphs. arXiv
preprint arXiv:1906.01195, 2019.
[33] Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Phung. A novel
embedding model for knowledge base completion based on convolutional neural
network. arXiv preprint arXiv:1712.02121, 2017.
[34] Ministry of Health of Vietnam. Memory and Attention, 2020. (Accessed on August
26, 2020).
[35] Stefano Ortona, Venkata Vamsikrishna Meduri, and Paolo Papotti. Robust dis-
covery of positive and negative rules in knowledge bases. In 2018 IEEE 34th
International Conference on Data Engineering (ICDE), pages 1168–1179. IEEE,
2018.
[36] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of
social representations. In Proceedings of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 701–710, 2014.
[37] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Lev-
skaya, and Jonathon Shlens. Stand-alone self-attention in vision models. arXiv
preprint arXiv:1906.05909, 2019.
[38] Andrea Rossi, Donatella Firmani, Antonio Matinata, Paolo Merialdo, and Denil-
son Barbosa. Knowledge graph embedding for link prediction: A comparative
analysis. arXiv preprint arXiv:2002.00819, 2020.
[39] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between
capsules. In Advances in neural information processing systems, pages 3856–3866,
2017.
[40] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared
Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter
57
language models using gpu model parallelism. arXiv preprint arXiv:1909.08053,
2019.
[41] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.
Line: Large-scale information network embedding. In Proceedings of the 24th
international conference on world wide web, pages 1067–1077, 2015.
[42] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchical multi-scale atten-
tion for semantic segmentation. arXiv preprint arXiv:2005.10821, 2020.
[43] Kristina Toutanova and Danqi Chen. Observed versus latent features for knowl-
edge base and text inference. In Proceedings of the 3rd Workshop on Continuous
Vector Space Models and their Compositionality, pages 57–66, 2015.
[44] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume
Bouchard. Complex embeddings for simple link prediction. In Proceedings of the
33rd International Conference on Machine Learning (ICML), 2016.
[45] Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. The
anatomy of the facebook social graph. arXiv preprint arXiv:1111.4503, 2011.
[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
In Advances in neural information processing systems, pages 5998–6008, 2017.
[48] Thanh Vu, Tu Dinh Nguyen, Dat Quoc Nguyen, Dinh Phung, et al. A capsule
network-based embedding model for knowledge graph completion and search per-
sonalization. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), pages 2180–2189, 2019.
[49] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding.
In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 1225–1234, 2016.
58
[50] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov,
and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language under-
standing. In Advances in neural information processing systems, pages 5753–5763,
2019.
59
Appendix A. OPTIMAL HYPERPARAMETERS
In this section, we present the set of optimal hyperparameters used for both models:
the attention-based model and the ConvKB model. The hyperparameter optimization
process was conducted using grid search based on the Hits@10 evaluation metric. For
the attention-based model, we trained on the entire dataset without any data splitting.
In contrast, for the ConvKB model, we applied the same hyperparameter configuration
across all datasets. The detailed hyperparameters are shown in the following table:
60