Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views73 pages

GCAT - Link Prediction in Knowledge Graphs

The graduation thesis by Hoang Minh Thanh and Phan Minh Tam focuses on link prediction in knowledge graphs, presenting both rule-based and deep learning-based methods. It highlights the AnyBURL model for rule-based approaches and introduces the KBGAT model that utilizes attention mechanisms for improved predictions. The research demonstrates significant advancements in link prediction accuracy across four standard datasets.

Uploaded by

Thanh Hoang-Minh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views73 pages

GCAT - Link Prediction in Knowledge Graphs

The graduation thesis by Hoang Minh Thanh and Phan Minh Tam focuses on link prediction in knowledge graphs, presenting both rule-based and deep learning-based methods. It highlights the AnyBURL model for rule-based approaches and introduces the KBGAT model that utilizes attention mechanisms for improved predictions. The research demonstrates significant advancements in link prediction accuracy across four standard datasets.

Uploaded by

Thanh Hoang-Minh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

UNIVERSITY OF SCIENCE

FACULTY OF INFORMATION TECHNOLOGY

Hoang Minh Thanh - Phan Minh Tam

GRADUATION THESIS REPORT ON LINK


PREDICTION IN KNOWLEDGE GRAPHS

BACHELOR GRADUATION THESIS


STANDARD PROGRAM

Ho Chi Minh City, October 2020


UNIVERSITY OF SCIENCE
FACULTY OF INFORMATION TECHNOLOGY

Hoang Minh Thanh - 18424062


Phan Minh Tam - 18424059

GRADUATION THESIS REPORT ON LINK


PREDICTION IN KNOWLEDGE GRAPHS

BACHELOR GRADUATION THESIS


STANDARD PROGRAM

SUPERVISOR
MSc. Le Ngoc Thanh
Department of Computer Science

Ho Chi Minh City, October 2020

2
DECLARATION

We hereby declare that this is our own research work. The data and research results
presented in this thesis are truthful and have not been duplicated from any other
projects.
All research results presented in this thesis are entirely truthful and accurate.

Ho Chi Minh City, day ... month ... year ...


CANDIDATE
(Signature and full name)

1
ACKNOWLEDGEMENTS

We would like to express our sincere gratitude to MSc Le Ngoc Thanh for his
dedicated guidance, for sharing his knowledge and experience, and for providing us
with valuable solutions throughout the completion of this graduation thesis.
We would also like to extend our thanks to the faculty members of the Faculty
of Information Technology, University of Science - Vietnam National University, Ho
Chi Minh City, who have imparted invaluable knowledge to us during our academic
journey.
Our gratitude also goes to the scientists and researchers whose work we have cited
and built upon to complete our thesis.
Lastly, we would like to thank our families, friends, and others who have always
supported and encouraged us during the process of working on this thesis.
Once again, our sincere thanks!

2
Table of Contents

Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
List of Symbols and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . 8
Glossary of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Abstract 1

Chapter 1 INTRODUCTION 1

Chapter 2 RELATED WORK 3


2.1 Definition of Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Link Prediction in Knowledge Graphs . . . . . . . . . . . . . . . . . . . 4
2.3 Research Areas in Knowledge Graphs . . . . . . . . . . . . . . . . . . . 6

Chapter 3 RULE-BASED METHOD 9


3.1 Horn Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Definition of Logical Graph Language . . . . . . . . . . . . . . . . . . . 10
3.3 AnyBURL Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 AnyBURL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Generate Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Extended AnyBURL Algorithm . . . . . . . . . . . . . . . . . . . . . . 15
3.4.1 Algorithm 3: Offline-to-Online Learning . . . . . . . . . . . . . 15
3.4.2 Algorithm 4: Online-to-Online Learning . . . . . . . . . . . . . 16

Chapter 4 DEEP LEARNING-BASED METHODS 17


4.1 Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 Graph Embedding Problem Settings . . . . . . . . . . . . . . . 20

3
4.1.2 Graph Embedding Techniques . . . . . . . . . . . . . . . . . . . 22
4.2 Multi-head Attention Mechanism . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Graph Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 KBGAT Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.1 Embedding Initialization . . . . . . . . . . . . . . . . . . . . . . 35
4.4.2 Encoder Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.3 ConvKB Prediction Model . . . . . . . . . . . . . . . . . . . . . 42

Chapter 5 EXPERIMENTS 44
5.1 Training Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 FB15k Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.2 FB15k-237 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.3 WN18 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.4 WN18RR Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Training Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 Training with the KBGAT Model . . . . . . . . . . . . . . . . . 48
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Chapter 6 CONCLUSION 52

REFERENCES 54

Chapter A Optimal Hyperparameters 60

4
LIST OF FIGURES

Figure 2.1: Example of an input graph . . . . . . . . . . . . . . . . . . . . . 3


Figure 2.2: A taxonomy of research areas in knowledge graphs . . . . . . . . 6

Figure 3.1: Example of a knowledge graph . . . . . . . . . . . . . . . . . . . 11

Figure 4.1: Graph Embedding Techniques . . . . . . . . . . . . . . . . . . . 19


Figure 4.2: Node embedding with each vector representing node features . . 20
Figure 4.3: Edge embedding with each vector representing edge features . . 21
Figure 4.4: Embedding a graph substructure . . . . . . . . . . . . . . . . . . 21
Figure 4.5: Whole-graph embedding . . . . . . . . . . . . . . . . . . . . . . 22
Figure 4.6: Graph Embedding Techniques . . . . . . . . . . . . . . . . . . . 23
Figure 4.7: Knowledge graph and normalized attention coefficients of the entity 32
Figure 4.8: Illustration of embedding vectors in the TransE model . . . . . . 36
Figure 4.9: TransE embedding model . . . . . . . . . . . . . . . . . . . . . . 38
Figure 4.10:Illustration of the encoder layers in the KBGAT model . . . . . 39
Figure 4.11:Illustration of the decoder layers of the ConvKB model with 3
filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5
LIST OF TABLES

Table 4.1: Comparison of Advantages and Disadvantages of Graph Embed-


ding Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Table 5.1: Dataset Information . . . . . . . . . . . . . . . . . . . . . . . . . 44


Table 5.2: Experimental results on the FB15k and FB15k-237 datasets . . . 49
Table 5.3: Experimental results on the WN18 and WN18RR datasets . . . 50
Table 5.4: Accuracy results of the two new knowledge addition strategies . 50
Table 5.5: Evaluation results on the number of rules of the two new knowl-
edge addition strategies . . . . . . . . . . . . . . . . . . . . . . . 51

6
LIST OF ALGORITHMS

Algorithm 1: Anytime Bottom-up Rule Learning . . . . . . . . . . . . . . 13


Algorithm 2: Generate Rules(p) . . . . . . . . . . . . . . . . . . . . . . . . 14
Algorithm 3: BatchAnyBURL Learning batch size . . . . . . . . . . . . . 15
Algorithm 4: EdgeAnyBURL . . . . . . . . . . . . . . . . . . . . . . . . . 16
Algorithm 5: TransE Embedding Learning Algorithm [5] . . . . . . . . . . 37

7
LIST OF SYMBOLS AND ABBREVIATIONS

List of Symbols
Symbol Description
G Graph
Gmono Homogeneous graph
Ghete Heterogeneous graph
Gknow Knowledge graph
V, E Set of vertices, set of edges
e, ei Entity, the i-th entity
r, rk Relation, the k-th relation
tijk , tkij An edge/triple

−e ,→−r Entity embedding, relation embedding
⟨h, r, t⟩ A triple of head entity, relation, tail entity
T v, T e Set of vertex types, edge types
Ne , Nr Number of entities, number of relations
Nhead Number of self-attention heads
R Set of real numbers
E, R Entity embedding matrix, relation embedding matrix
S Training dataset
∗ Convolution operation
σ Non-linear activation function
W Weight matrix
fK
k=1 Concatenation from layer 1 to K
|| Concatenation
.T Transpose
||W ||22 L2 normalization
∨, ∧ Disjunction (OR), conjunction (AND)
⊕ Binary operation
∩ Intersection
¬ Negation

8
Vn
i=1 Chain conjunction
γ Margin
µ Learning rate
ω Number of convolutional layers
Ω Convolutional kernel/filter set

9
GLOSSARY OF TERMS

Term Full Meaning


AnyBURL Anytime Bottom-up Rule Learning – an algorithm for
learning bottom-up rules at any time
Knowledge Graph (Gknow ) A graph of knowledge – a collection of ground atoms or
facts
ConvE / ConvKB CNN-based models for graph embeddings – variants us-
ing embedding vectors for ⟨h, r, t⟩
Graph Attention Network Graph neural network using attention mechanisms to
(GAT) aggregate neighborhood information
KBGAT GAT with relation embeddings – a GAT variant incor-
porating relation embeddings
Saturation (SAT) Saturation level – the ratio of newly learned rules to
existing rules

10
Abstract
A knowledge graph is a structure used to represent real-world informa-
tion, which has been successfully researched and developed by Google
for its search engine [15]. The exploitation of knowledge graphs not
only involves querying and analysis, but also completing missing in-
formation and predicting links based on the available data within the
graph. Therefore, in this report, we present an overview of knowl-
edge graphs and two methods for link prediction in graphs: rule-based
methods and deep learning-based methods.
For the rule-based method, we rely on the AnyBURL model and
propose two additional strategies for inserting new knowledge into the
graph.
For the deep learning-based method, we review the attention mech-
anism in natural language processing, which is then applied to knowl-
edge graphs, and present a fully detailed improved model called KB-
GAT. By stacking layers, nodes can attend to their neighboring fea-
tures without incurring additional computational cost or relying on
prior knowledge of the graph structure.
Our two models achieved significantly better results compared to
other link prediction methods applied on four standard datasets.
Chapter 1. INTRODUCTION

Nowadays, graphs have been applied in all aspects of life. Social network graphs (e.g.,
Facebook [45]) illustrate how individuals are connected to each other, the places we
visit, and the information we interact with. Graphs are also used as core structures
in video recommendation systems (e.g., YouTube [3]), flight networks, GPS naviga-
tion systems, scientific computations, and even brain connectivity analysis. Google’s
Knowledge Graph [15], introduced in 2012 [22], is a notable example of how information
can be structured and utilized in knowledge graphs.
Effectively exploiting knowledge graphs provides users with deeper insight into the
underlying data, which can benefit many real-world applications. However, in prac-
tice, new knowledge is continuously generated, and the acquired information is often
incomplete or missing. This leads to the problem of knowledge graph completion or
link prediction in knowledge graphs.
Most current approaches aim to predict a new edge connecting two existing nodes.
Such methods help make the graph more complete—i.e., denser—by introducing ad-
ditional connecting edges. However, these approaches primarily address the problem
of completion rather than the challenge of integrating new knowledge into the graph,
which remains an open question. Currently, research in knowledge graph completion
follows two main directions: one is optimizing an objective function to make predic-
tions with minimal error, as in RuDiK [35], AMIE [14], and RuleN [29], which are
typically used in vertex or edge classification applications. The other approach gener-
ates a ranked list of k candidate triples, where the score reflects decreasing confidence,
as seen in studies such as TransE [5] and ConvKB [48], which are commonly used in
recommendation systems. Our approach follows this second direction of producing a
candidate list.
Within these approaches, there are two main methodologies: rule-based systems
such as AnyBURL [27], and embedding-based methods such as ConvE [11], TransE
[5], and ComplEx [44]. With the goal of gaining a systematic understanding of these

1
methods, we chose to explore both directions in this thesis. For the rule-based ap-
proach, we selected AnyBURL [27], and for the graph embedding-based method, we
chose KBAT [32], which employs attention mechanisms.
Our contribution in the AnyBURL method includes a Python implementation 1 ,
along with two proposed strategies for adding new knowledge to the graph, which
we term online-to-offline and online-to-online. The online-to-offline strategy extends
AnyBURL by generating rules when a batch (set) of new knowledge is added. The
online-to-online strategy generates rules immediately when a single new piece of knowl-
edge (edge) is added.
For the embedding-based method, we present a review of attention mechanisms [46],
their application in knowledge graphs via Graph Attention Networks (GATs) [47], and
the KBAT model [32].
Our contribution in the deep learning approach includes a publicly available im-
plementation and training process on GitHub 2 , with both training code and model
results openly provided.

1 https://github.com/MinhTamPhan/mythesis

2 https://github.com/hmthanh/GCAT

2
Chapter 2. RELATED WORK

In this section, we present the basic definitions of knowledge graphs in order to under-
stand the task of link prediction in knowledge graphs, as well as other related research
directions.

2.1 Definition of Knowledge Graphs


The basic definitions of knowledge graphs are compiled and categorized by Cai, Hongyun
[7] and Goyal, Palash [16] as follows:

Figure 2.1: Example of an input graph

ˆ Definition 1 (Graph) G = (V, E), where v ∈ V is a vertex and e ∈ E is an


edge. G is associated with a vertex-type mapping function fv : V → T v and an
edge-type mapping function fe : E → T e .

Here, T v and T e are the sets of vertex types and edge types, respectively. Each
vertex vi ∈ V belongs to a specific type, i.e., fv (vi ) ∈ T v . Similarly, for eij ∈
E, fe (eij ) ∈ T e .

3
ˆ Definition 2 (Homogeneous Graph) Homogeneous graph: Ghomo = (V, E)
is a graph where | T v |=| T e |= 1. All vertices in G belong to a single type, and
all edges belong to a single type.

ˆ Definition 3 (Heterogeneous Graph) Heterogeneous graph: Ghete = (V, E)


is a graph where | T v |> 1 or | T e |> 1. That is, there is more than one type of
vertex or more than one type of edge.

ˆ Definition 4 (Knowledge Graph) Knowledge graph: Gknow = (V, R, E) is a


directed graph, where the vertex set represents entities, the relation set repre-
sents relationships between entities, and the edge set E ⊆ V × R × V repre-
sents events in the form of subject-property-object triples. Each edge is a triple
(entityhead , relation, entitytail ) (denoted as ⟨h, r, t⟩), expressing a relation r from
head entity h to tail entity t.

Here, h, t ∈ V are entities and r ∈ R is a relation. We refer to ⟨h, r, t⟩ as a


knowledge graph triple.

Example: in Figure 4.7, there are two triples: ⟨Tom Cruise, born in, New York⟩
and ⟨New York, state of, U.S⟩. Note that entities and relations in a knowledge
graph often belong to different types. Therefore, a knowledge graph can be viewed
as a specific case of a heterogeneous graph.

2.2 Link Prediction in Knowledge Graphs


Link prediction, also known as knowledge graph completion, is the task of exploiting
known facts (events) in a knowledge graph to infer missing ones. This is equivalent to
predicting the correct tail entity in a triple ⟨h, r, ?⟩ (tail prediction) or the correct head
entity in ⟨?, r, t⟩ (head prediction). For simplicity, instead of distinguishing between
head and tail prediction, we generally refer to the known entity as the source entity
and the entity to be predicted as the target entity.
Most current research on link prediction in knowledge graphs is related to ap-
proaches that focus on embedding a given graph into a low-dimensional vector space.
In contrast to these approaches is a rule-based method explored in [27]. Its core algo-
rithm samples arbitrary rules and generalizes them into Horn clauses [25], then uses
statistics to compute the confidence of these generalized rules. When predicting a new

4
link (edge) in the graph, the task is to infer whether an edge with a specific label ex-
ists between two given nodes. Many methods have been proposed to learn rules from
graphs, such as in RuDiK [35], AMIE [14], and RuleN [29].
As mentioned earlier, there are two main approaches to this problem: one is op-
timizing an objective function to find a small set of rules that cover the majority of
correct examples with minimal error, as explored in RuDiK [35]. The other approach,
which we adopt in this thesis, aims to explore all possible rules and then generate a
top-k ranking of candidate triples, each associated with a confidence score measured
on the training set.
Our rule-based method is largely based on the Anytime Bottom-Up Rule Learning
for Knowledge Graph Completion method [28], hereafter referred to as AnyBURL.
As its name suggests, this method primarily focuses on completing the graph by filling
in missing parts. A key limitation of this model is that when a new edge or fact is
added to the graph, the entire model must be retrained. We address this issue using
two strategies: the offline-to-online strategy, which retrains a portion of the graph
only after a batch of new edges is added; and the online-to-online strategy, which
immediately retrains the affected portion of the graph whenever a new edge is added.
In the deep learning branch of approaches, many successful techniques from image
processing and natural language processing have been applied to knowledge graphs,
such as Convolutional Neural Networks (CNNs [24]), Recurrent Neural Networks (RNNs
[20]), and more recently, Transformers [50] and Capsule Neural Networks (CapsNets
[39]). In addition, other techniques such as random walks and hierarchical structure-
based models have also been explored. The common advantage of these deep learning
methods on knowledge graphs is their ability to automatically extract features and
generalize complex graph structures based on large amounts of training data. How-
ever, some methods focus mainly on grid-like structures and fail to preserve the spatial
characteristics of knowledge graphs.
The attention mechanism, particularly the multi-head attention layer, has been
applied to graphs through the Graph Attention Network (GAT [47]) model, which
aggregates information about an entity based on attention weights from its neighboring
entities. However, GAT lacks integration of relation embeddings and the embeddings of
an entity’s neighbors—components that are crucial for capturing the role of each entity.
This limitation has been addressed in the work Learning Attention-based Embeddings

5
for Relation Prediction in Knowledge Graphs (KBAT [32]), which we adopt as the
foundation for our study.
The attention mechanism is currently one of the most effective (state-of-the-art)
deep learning structures, as it has been proven to substitute any convolution operation
[9]. Moreover, it serves as a core component in leading models for natural language
processing, such as Megatron-LM [40], and image segmentation, such as HRNet-OCR
(Hierarchical Multi-Scale Attention [42]). Some recent works [10] have proposed inter-
esting improvements based on the attention mechanism. However, these advancements
have not yet been applied to knowledge graphs, which motivates us to adopt this family
of methods to integrate the latest innovations into knowledge graph modeling.

2.3 Research Areas in Knowledge Graphs

- Distance-based - Point-wise- Manifold


- Semantic - Complex - Gaussian - Single-fact QA
- Others - Discrete - Multi-step Reasoning

- Linear / Bilinear Scoring Function Natural Language Question Answering


Embedding Space
- Matrix Factorization Understanding
Dialogue System
- Neural Nets
Knowledge
- CNN Representa- Recommendation System
tion Learning Knowledge
Model Encoding Inference
- RNN
- Question Generation
- Transformers
Knowledge Other Applications - Search Engines
- GCN Auxiliary Information Graph
- Medical Applications
Entity Discovery
Knowledge - Health Recovery
- Textual- Type- Visual Acquisition
- Zero-shot Image Classification
Temporal
Relation Extraction Knowledge
- Recognition - Text Generation
Graph Completion Graph
- Typing - Semantic Analysis
- Disambiguation - Neural Nets - Ranking-based Embeddings
- Ranking - Attention - Textual Reasoning Temporal Embeddings
- GCN - Rule-based Reasoning
- GAN Dynamic Entities
- Hyper-relational Learning
- RL - Triple Classification Time-Independent Relations
- Others
Temporal Logic Reasoning

Figure 2.2: A taxonomy of research areas in knowledge graphs

Knowledge representation has a long-standing history in logic and artificial intel-


ligence. In the context of knowledge graphs, four major research areas have been

6
categorized and summarized in the survey [22], including: Knowledge Representation
Learning, Knowledge Acquisition, Temporal Knowledge Graphs, and Knowledge-aware
Applications. All research categories are illustrated in Figure 2.2.
Knowledge Representation Learning
Knowledge representation learning is an essential research topic in knowledge graphs
that enables a wide range of real-world applications. It is categorized into four sub-
groups:

ˆ Representation Space focuses on how entities and relations are represented in


vector space. This includes point-wise, manifold, complex vector space, Gaussian
distribution, and discrete space embeddings.

ˆ Scoring Function studies how to measure the validity of a triple in practice. These
scoring functions may be distance-based or similarity-based.

ˆ Encoding Models investigate how to represent and learn interactions among rela-
tions. This is currently the main research direction, including linear or non-linear
models, matrix factorization, or neural network-based approaches.

ˆ Auxiliary Information explores how to incorporate additional information into


embedding models, such as textual, visual, and type information.

Knowledge Acquisition
Knowledge acquisition focuses on how to extract or obtain knowledge based on
knowledge graphs, including knowledge graph completion, relation extraction, and en-
tity discovery. Relation extraction and entity discovery aim to extract new knowledge
(relations or entities) into the graph from text. Knowledge graph completion refers
to expanding an existing graph by inferring missing links. Research directions include
embedding-based ranking, relation path reasoning, rule-based reasoning, and hyper-
relational learning.
Entity discovery tasks include entity recognition, disambiguation, typing, and rank-
ing. Relation extraction models often employ attention mechanisms, graph convolu-
tional networks (GCNs), adversarial training, reinforcement learning (RL), deep learn-
ing, and transfer learning, which is the foundation of the method proposed in our
work.
In addition, other major research directions in knowledge graphs include tempo-
ral knowledge graphs and knowledge-aware applications. Temporal knowledge

7
graphs incorporate temporal information into the graph to learn temporal representa-
tions. Knowledge-aware applications include natural language understanding, question
answering, recommendation systems, and many other real-world tasks where integrat-
ing knowledge improves representation learning.

8
Chapter 3. RULE-BASED METHOD

In this chapter, we describe how the problem is reformulated using the rule-based
approach AnyBURL, including the rule (path) sampling algorithm and the rule gen-
eralization algorithm used to store learned knowledge in the model. We also present
our improvements to the training process when new knowledge (edges) is added to the
graph.

3.1 Horn Clauses


In mathematical logic, an atomic formula [19], also simply called an atom, is a for-
mula that contains no logical connectives such as conjunction (∧), disjunction (∨), or
biconditional (⇔). It is a formula with no proper subformulas—meaning that an atom
cannot be decomposed into smaller atoms. Thus, atomic formulas are the simplest ex-
pressions used to construct logical rules. Compound formulas are formed by combining
atomic formulas using logical connectives.
A literal [6] is either an atomic formula or the negation of one. This concept
primarily arises in classical logic theory. Literals are classified into two types: A
positive literal is simply an atomic formula (e.g., x). A negative literal is the
negation of an atomic formula (e.g., ¬x). Whether a literal is considered positive or
negative depends on its defined form.
A clause is either a single literal or a disjunction of two or more literals. In Horn
form, a clause contains at most one positive literal. Note: Not all propositional logic
formulas can be converted into Horn form. A clause with no literals is sometimes
referred to as a unit clause, and a unit clause without variables is often called a fact
[25]. An atomic formula is referred to as a ground or ground atom if it is constructed
entirely from unit clauses. All possible ground atoms that can be formed from a set of
function and predicate symbols make up the Herbrand base for those symbols [1].

9
3.2 Definition of Logical Graph Language
Unlike general definitions of knowledge graphs commonly used in graph embedding
methods, our rule-based approach treats the graph as a formal language. Below are
the formal language definitions of the knowledge graph.
A knowledge graph Gknow is defined over a vocabulary ⟨C, R⟩, where C is the set of
constants and R is the set of binary predicates. Then, Gknow = {r(a, b) | r ∈ R, a, b ∈
C} is the set of ground atoms or facts. A binary predicate is referred to as a relation,
and a constant (or referenced constant) is referred to as an entity, corresponding to a
data entry in the training set. In what follows, we use lowercase letters for constants
and uppercase letters for variables. This is because we do not learn arbitrary Horn
rules; instead, we focus only on those rule types that can be generalized as discussed
below.
We define a rule as h(c0 , cn ) ← b1 (c0 , c1 ), . . . , bn (cn , cn+1 ), which is a path of ground
atoms of length n. Here, h(. . . ) is referred to as the head atom, and b1 (c0 , c1 ), . . . , bn (cn , cn+1 )
are referred to as the body atoms. We distinguish the following three types of rules:
- Binary rules (B): Rules in which the head atom contains two variables. - Unary
rules ending in a dangling node (Ud ): Rules where the head atom contains only one
variable, and the rule ends in a body atom that contains only variables (no constants).
- Unary rules ending in a constant (Uc ): Rules where the head atom also contains only
one variable, but the rule ends with an atom that may link to an arbitrary constant.
If that constant matches the constant in the head atom, the rule forms a cyclic path.

n
^
B : h(A0 , An ) ← bi (Ai−1 , Ai )
i=1
n
^
Ud : h(A0 , c) ← bi (Ai−1 , Ai ) (3.1)
i=1
n−1
^
Uc : h(A0 , c) ← bi (Ai−1 , Ai ) ∧ bn (An−1 , c′ )
i=1
We refer to rules of these types as path rules because the body atoms (the part
after the ← symbol) form a path. Note that this also includes variants of rules where
the variables are reversed within the atoms. Given a knowledge graph Gknow , a path
of length n is a sequence of n triples pi (ci , ci+1 ) where either pi (ci , ci+1 ) ∈ Gknow or
pi (ci+1 , ci ) ∈ Gknow , with 0 ≤ i ≤ n. The abstract rule patterns presented above are

10
considered to have length n, as their body atoms can be instantiated into a path of
length n − 1. For example, in Figure 3.1 1 , when sampling paths of length 3, we can
obtain the following two rules: the rule marked in green and the rule marked in red.

(green) speaks(ed, d) ← married(ed, lisa) ∧ born(lisa, a)


(red) speaks(ed, d) ← lives(ed, nl) ∧ lang(nl, d)

Source: adapted from Anytime Bottom-Up Rule Learning

Figure 3.1: Example of a knowledge graph

In addition, rules of type B and Uc are also referred to as closed-path rules. These
are utilized by the AMIE model, described in [13, 14]. Rule Ud is considered an open
rule or an acyclic path rule, since An is a variable that appears only once. For example:

speaks(X, Y ) ← lives(X, Y ) (1)


lives in city(X, Y ) ← lives(X, A), within(Y, A) (2)
gen(X, f emale) ← married(X, A), gen(A, male) (3)
profession(X, actor) ← acted in(X, A) (4)
Rule (1) is a B-type rule (binary rule). This rule states that a person (entity)
X speaks language Y if person X lives in country Y . Clearly, this is a general rule:
whenever entity X has an edge to entity Y labeled lives, we can infer the existence of
another edge labeled speaks between X and Y .
Rules (2) and (3) are both Uc -type rules. Rule (2) states that a person X lives in
city Y if person X lives in country A and city Y is located in country A. Rule (3)
1 http://web.informatik.uni-mannheim.de/AnyBURL/2019-05/meilicke19anyburl.pdf

11
states that a person X is female if they are married to person A and person A is male.
In rule (3), there is no cycle formed in the graph, unlike in rule (2), where node Y is
repeated both in the head atom and as the final node in the body atoms.
Rule (4) is a Ud -type rule, which states that a person X is an actor if they acted
in a movie A.
All rules under consideration are filtered based on a score called the rule’s confi-
dence, which is computed on the training dataset. This confidence score is defined as
the number of body atom paths that lead to the head atom, divided by the total number
of paths that contain only the body atoms.
For example, consider the following rule: gen(X, f emale) ← married(X, A), gen(A, male).
We first count all entity pairs that satisfy the relations married(X, A), gen(A, male)—this
is the number of paths containing the body atoms. Then we count how many of those
pairs also satisfy the inferred relation gen(X, f emale); this is the number of body atom
paths that lead to the head atom. The confidence score of the rule is the ratio of the
latter to the former.

3.3 AnyBURL Algorithm


In this section, we describe the core algorithm of the AnyBURL method, as originally
introduced in [27], as well as our two extended algorithms designed to handle situa-
tions where the graph is incrementally updated with one or more new facts (edges).
Additionally, we briefly describe how rules are initialized and how rule confidence is
computed using sampling on the training set, including the issue of confidence estima-
tion during prediction when sampling is used.

12
3.3.1 AnyBURL

Algorithm 1 Anytime Bottom-up Rule Learning


1: procedure AnyBURL(Gknow , s, sat, Q, ts)

2: n=2
3: R=∅
4: loop
5: Rs = ∅
6: start = currentT ime()
7: repeat
8: p = sampleP ath(Gknow , n)
9: Rp = generateRules(p)
10: for r ∈ Rp do
11: score(r, s)
12: if Q(r) then
13: Rs = Rs ∪ {r}
14: until currentT ime() > start + ts
15: Rs′ = Rs ∩ R
16: if | R′ | / | R |> SAT then
17: n=n+1
18: R = Rs ∩ R
return R

The input of the algorithm consists of Gknow , S, SAT, Q, T S. The output is the set R
of learned rules. Here, Gknow is a knowledge graph derived from the training dataset.
S is a parameter indicating the sample size used during each sampling iteration on
the training data for confidence computation. SAT denotes the saturation level of the
rules generated in each iteration; this saturation is calculated based on the number of
new rules learned in the current iteration relative to the total number of rules already
learned. If this value is below the saturation threshold, we consider that there is still
potential to discover rules of length n. Otherwise, we increase the rule length and
continue the rule mining process. Q is a threshold used to determine whether a newly
generated rule should be added to the result set. T S indicates the total learning time
of the algorithm.

13
We start with n = 2, which corresponds to rules of path length 2, since a valid
path rule requires at least one literal in the head atom and one in the body atoms. In
the rule sampling step (samplePath), we simply select a random node in the graph,
traverse all possible paths from that node with length n, and then randomly select one
of the traversed paths.

3.3.2 Generate Rules

Algorithm 2 Generate Rules(p)


1: procedure generate rules(p)

2: generalizations = ∅
3: is binary rule = random.choices([true, f alse])
4: if is binary rule then
5: replace all head by variables(p)
6: replace all tail by variables(p)
7: add(generalizations, p)
8: else:
9: replace all head by variables(p)
10: add(generalizations, p)
11: replace all tail by variables(p)
12: add(generalizations, p)
return generalizations

In this algorithm, we substitute constants into the head and tail of all path rules
from the sampled rule in the previous step if the rule to be learned is a binary rule.
Otherwise, we substitute either the head or the tail and then add the rule to the return
set. We then sample a set of rules from the training set and compute their confidence
scores as described in Subsection 3.3.2. To reduce computational cost, we choose to
sample from the training set for this calculation.
When making predictions for rule candidates, we recompute confidence by incorpo-
rating an estimated number of incorrect rules not observed during sampling. For our
model, after experimenting with the parameter in the range [5, 10], we found that this
yields the best results.

14
3.4 Extended AnyBURL Algorithm

3.4.1 Algorithm 3: Offline-to-Online Learning

Algorithm 3 BatchAnyBURL Learning batch size


1: procedure BatchAnyBURL(Gknow , sat, Q, ts, batch edge)

2: is connected = add(Gknow , batch edge)


3: if is connected then
4: G′ = Gknow ⊕ batch edge
5: else
6: G′ = batch edge
7: n=2
8: R=∅
9: loop
10: Rs = ∅
11: start = currentT ime()
12: repeat
13: p = sampleP ath(G′ , n)
14: Rp = generateRules(p)
15: for r ∈ Rp do
16: score(r, G′ )
17: if Q(r) then
18: Rs = Rs ∪ {r}
19: until currentT ime() > start + ts
20: Rs′ = Rs ∩ R
21: if | R′ | / | R |> SAT then
22: n=n+1
23: R = Rs ∩ R
return R

This algorithm is our proposed extension to avoid retraining the entire model when a
new set of knowledge is added to the graph. When new knowledge is added, we first
check whether it is connected to the existing knowledge in the graph (i.e., connectivity).
If it is, we perform the ⊕ operation by combining all elements in batch edge with the

15
connected components in the graph, up to a path length of 5. If there is no connectivity,
we use all elements in batch edge and repeat the steps of the Anytime Bottom-up Rule
Learning algorithm.

3.4.2 Algorithm 4: Online-to-Online Learning

Algorithm 4 EdgeAnyBURL
1: procedure EdgeAnyBURL(Gknow , s, sat, Q, ts, edge)

2: is connected = add(Gknow , edge)


3: R=∅
4: if is connected then
5: n=2
6: Rs = ∅
7: repeat
8: p = sampleP ath(edge, n)
9: Rp = generateRules(p)
10: for r ∈ Rp do
11: score(r, s)
12: if Q(r) then
13: Rs = Rs ∪ {r}
14: until currentT ime() > start + ts
15: Rs′ = Rs ∩ R
16: if | R′ | / | R |> SAT then
17: n=n+1
18: R = Rs ∩ R
19: return R

This algorithm is a complementary component to Algorithm 3. We refer to it as


online-to-online because when a new edge (i.e., new knowledge) is added to the graph,
we immediately perform learning on the path rules related to that edge—unlike in
Algorithm 3, where learning is triggered only after a sufficient amount of new knowledge
has been added.

16
Chapter 4. DEEP LEARNING-BASED METHODS

In this chapter, we present Knowledge Graphs and describe the task of Graph Embed-
ding, providing an overview of current graph embedding techniques. We will revisit the
attention mechanism and explain how it is applied to knowledge graphs through the
Graph Attention Network (GAT) model [47]. Additionally, we present an improved
method based on the graph attention model - KBGAT [32] - which incorporates relation
information and neighboring relations.

4.1 Graph Embedding


In the real world, representing entities and relations as vectors can be intuitively un-
derstood as the process of mapping features and attributes of an object into a lower-
dimensional space, with each component representing a specific unit-level feature.
For example, we know Donald Trump is 1.9 meters tall and has a wife named
Melania. Thus, we could represent the entity ”Donald Trump” as a vector:


e−−−→
Trump = [1.9height , 0area , 1wife is Melania , 0wife is Taylor ].

For features that cannot be measured or have no value (e.g., .area), we assign
0. For categorical features without magnitude (e.g., .wife), we represent them using
probabilities of unit features (e.g., .wife is Melania, .wife is Taylor). Therefore,
any object in the real world can be embedded as a vector in an interpretable way.
To understand graph embedding techniques, we begin with several fundamental
definitions:

ˆ Definition 5 (First-Order Proximity) First-order proximity between vertex


vi and vertex vj is the edge weight Ai,j of the edge eij .

Two vertices are more similar if they are connected by an edge with a higher
(1)
weight. Thus, the first-order proximity between vi and vj is denoted as sij =

17
h i
(1) (1) (1) (1)
Ai,j . Let si = si1 , si2 , . . . , si|V | represent the first-order proximities between
vi and other vertices.

Using the graph in Figure 2.1 as an example, the first-order proximity between
(1) (1)
v1 and v2 is the weight of edge e12 , denoted as s12 = 1.2. The vector s1 records
the edge weights connecting v1 to all other vertices in the graph, i.e.,

h i
(1)
s1 = 0, 1.2, 1.5, 0, 0, 0, 0, 0, 0 .

(2)
ˆ Definition 6 (Second-Order Proximity) Second-order proximity sij between
vertex vi and vj is defined as the similarity between vi′ ’s first-order neighborhood
(1) (1)
vector si and vj′ ’s vector sj .
(2)
For example, in Figure 2.1, the second-order proximity s12 is the similarity be-
(1) (1)
tween s1 and s2 . As introduced above:

h i h i
(1) (1)
s1 = 0, 1.2, 1.5, 0, 0, 0, 0, 0, 0 , s2 = 1.2, 0, 0.8, 0, 0, 0, 0, 0, 0 .

We compute the cosine similarity:

(2) (1) (1) (2) (1) (1)


s12 = cos(s1 , s2 ) = 0.43, s15 = cos(s1 , s5 ) = 0.

We observe that the second-order proximity between v1 and v5 is 0 because they


share no common 1-hop neighbors. v1 and v2 share a common neighbor v3 , thus
(2)
their second-order proximity s12 is greater than 0.

Higher-order proximities can be defined similarly. For example, the k-th order
(k−1) (k−1)
proximity between vi and vj is the similarity between si and sj .

ˆ Definition 7 (Graph Embedding) Given a graph input G = (V, E) and a pre-


defined embedding dimension d where d ≪ |V |, the graph embedding problem is to
map G into a d-dimensional space while preserving as much graph property infor-
mation as possible. These properties can be quantified using proximity measures
such as first-order and higher-order proximity. Each graph is represented either
as a d-dimensional vector (for the entire graph) or a set of d-dimensional vectors
where each vector encodes a part of the graph (e.g., node, edge, substructure).

18
Graph embedding is the process of transforming graph features into vectors or sets of
low-dimensional vectors. The more effective the embedding, the higher the accuracy in
subsequent graph mining and analysis tasks. The biggest challenge in graph embedding
depends on the problem setting, which includes both the embedding input and output,
as illustrated in Figure 4.1.

Graph Embedding

Graph Embedding Input Graph Embedding Output

Homogeneous Graph Node Embedding

Heterogeneous Graph Edge Embedding

Graph with Auxiliary Information Joint Embedding

Graph Structured from Non- Whole-Graph Embedding


relational Data

Figure 4.1: Graph Embedding Techniques

Based on the embedding input, we categorize the surveyed methods in [7] as follows:
Homogeneous Graph, Heterogeneous Graph, Graph with Auxiliary Information, and
Graph Constructed from Non-relational Data.
Different types of embedding inputs preserve different information in the embedding
space and therefore pose different challenges for the graph embedding problem. For
example, when embedding a graph with only structural information, the connections
between nodes are the primary target to preserve. However, for graphs with node labels
or entity attribute information, auxiliary information provides additional context for
the graph and can therefore also be considered during the embedding process. Unlike
embedding input, which is fixed and provided by the dataset, the embedding output
is task-specific.
For instance, the most common embedding output is node embedding, which
represents each node as a vector that reflects similarity between nodes. Node embed-
dings are beneficial for node-related tasks such as node classification, node clustering,
etc.
However, in some cases, the tasks may involve more fine-grained graph components
such as node pairs, subgraphs, or the entire graph. Therefore, the first challenge of

19
embedding is to determine the appropriate type of embedding output for the application
of interest. Four types of embedding outputs are illustrated in Figure 2.1, including:
Node Embedding (4.2), Edge Embedding (4.3), Hybrid Embedding (4.4), and
Whole-Graph Embedding (4.5). Different output granularities have distinct criteria
and present different challenges. For example, a good node embedding retains similarity
with its neighbors in the embedding space. Conversely, a good whole-graph embedding
represents the entire graph as a vector that preserves graph-level similarity.

4.1.1 Graph Embedding Problem Settings


While the input is determined by the type of information to be preserved, the output
varies depending on the downstream graph mining task. Therefore, we discuss in more
detail the embedding methods based on the type of output required by the embedding
problem.
Node Embedding

Figure 4.2: Node embedding with each vector representing node features

Node embedding represents each node as a low-dimensional vector. Nodes that are
close in the graph have similar vector representations. The difference among graph em-
bedding methods lies in how they define the closeness between two nodes. First-order
proximity (Definition 5) and second-order proximity (Definition 6) are two commonly
used metrics to measure pairwise node similarity. Higher-order proximity has also been
explored to some extent. For example, capturing k-step (k = 1, 2, 3, ···) neighborhood
relationships during embedding is discussed in the study by Cao, Shaosheng [8].
Edge Embedding

20
Figure 4.3: Edge embedding with each vector representing edge features

In contrast to node embedding, edge embedding aims to represent an edge as a


low-dimensional vector. Edge embedding is useful in two cases:
First, in knowledge graph embedding. Each edge is a triple ⟨h, r, t⟩ (Definition 4).
The embedding is learned to preserve the relation r between h and t in the embedding
space so that a missing entity or relation can be accurately predicted given the other
two components in ⟨h, r, t⟩.
Second, some methods embed a pair of nodes as a vector to compare it with other
node pairs or to predict the existence of a link between the two nodes. Edge embedding
benefits graph analyses that involve edges (node pairs), such as link prediction, relation
prediction, and knowledge-based entity inference.
Hybrid Embedding

Figure 4.4: Embedding a graph substructure

Hybrid embedding refers to embedding combinations of different graph components,

21
e.g., node + edge (i.e., substructure), or node + parts. Substructure or part embeddings
can also be derived by aggregating individual node and edge embeddings. However,
such indirect approaches are not optimized to capture the structure of the graph.
Moreover, node and part embeddings can reinforce each other. Node embeddings
improve by learning from high-order neighborhood attention, while part embeddings
become more accurate due to the collective behavior of their constituent nodes.
Whole-Graph Embedding

Figure 4.5: Whole-graph embedding

Whole-graph embedding is typically used for small graphs such as proteins, molecules,
etc. In this case, an entire graph is represented as a vector, and similar graphs are em-
bedded close to each other. Whole-graph embedding is useful for graph classification
tasks by providing a simple and effective solution for computing graph similarity. To
balance embedding time (efficiency) and information retention (expressiveness), hi-
erarchical graph embedding [31] introduces a hierarchical embedding framework.
It argues that accurate understanding of global graph information requires processing
substructures at multiple scales. A graph pyramid is formed where each level is a
coarsened graph at a different scale. The graph is embedded at all levels and then
concatenated into a single vector. Whole-graph embedding requires collecting features
from the entire graph, thus is generally more time-consuming than other settings.

4.1.2 Graph Embedding Techniques


In this section, we categorize graph embedding methods based on the techniques used.
As previously stated, the goal of graph embedding is to represent a graph in a lower-
dimensional space while preserving as much of the original graph information as pos-

22
Graph Embedding Techniques

Deep Learning Matrix Factorization Edge Reconstruction Core Graph Generative Models

Maximizing edge
Laplacian Eigen- Latent space-based
With random walk reconstruction Graphlet-based
maps embedding
probability

Without random Matrix factoriza- Semantic fusion


Minimizing Tree pattern-based
walk tion via node ap- embedding
proximation distance-based loss

Random walk-
Minimizing based
margin-based
ranking loss

Figure 4.6: Graph Embedding Techniques

sible. The main differences among embedding techniques lie in how they define which
intrinsic graph properties should be preserved. Since the primary focus of our work is
on graph embedding methods based on deep learning, we provide only brief overviews
of the other method categories.
Deep Learning
In this section, we present in detail the research directions of deep learning tech-
niques, including: using random walks and not using random walks. Deep learning
techniques are widely used in graph embedding due to their speed and efficiency in au-
tomatically capturing features. Among these deep learning-based methods, three types
of input-based graph settings (excluding graphs constructed from non-relational data)
and all four output types (as shown in Figure 4.1) can adopt deep learning approaches.
Deep learning techniques with random walks
In this category, the second-order proximity (Definition 6) in the graph is preserved
in the embedding space by maximizing the probability of observing a vertex’s neigh-
borhood conditioned on its embedding vector. The graph is represented as a set of
samples obtained via random walks, and then deep learning methods are applied to
ensure the structural properties (i.e., path-based information) are preserved. Repre-
sentative methods in this group include: DeepWalk [36], LINE [41], Node2Vec [17],
Anonymous Walk [21], NetGAN [4], etc.
Deep learning techniques without random walks
This approach applies multi-layer learning structures effectively and efficiently to
transform the graph into a lower-dimensional space. It operates over the entire graph.

23
Several popular methods have been surveyed and presented in [38], as follows:

ˆ Convolutional Neural Networks (CNNs)

This model uses multiple convolutional layers: each layer performs convolution
on the input data using low-dimensional filters. The result is a feature map,
which then passes through a fully connected layer to compute the probability
values. For example, ConvE [11]: each entity and relation is represented as
a low-dimensional vector (d-dimensional). For each triple, it concatenates and
reshapes the head h and relation r embeddings into a single input [h, r] with
shape dm × dn . This is then passed through a convolutional layer with a filter ω
of size m×n, followed by a fully connected layer with weights W . The final result
is combined with the tail embedding t using a dot product. This architecture can
be considered a multi-class classification model.

Another popular model is ConvKB [33], which is similar to ConvE but concate-
nates the three embeddings h, r, and t into a matrix [h, r, t] of shape d × 3. It is
then passed through a convolutional layer with T filters of size 1 × 3, resulting
in a feature map of size T × 3. This is further passed through a fully connected
layer with weights W. This architecture can be considered a binary classification
model.

ˆ Recurrent Neural Networks (RNNs)

These models apply one or more recurrent layers to analyze the entire path (a
sequence of events/triples) sampled from the training set, instead of treating
each event independently. For example, RSN [18] notes that traditional RNNs
are unsuitable for graphs because each step only takes the relation information
without considering the entity embedding from the previous step. Therefore, it
fails to clearly model the transitions among entity-relation paths. To address
this, they propose RSN (Recurrent Skipping Networks [18]): at each step, if the
input is a relation, a hidden state is updated to reuse the entity embedding. The
output is then dot-multiplied with the target embedding vector.

ˆ Capsule Neural Networks

Capsule networks group neurons into ”capsules” , where each capsule encodes
specific features of the input, such as representing a particular part of an image.

24
One advantage of capsule networks is their ability to capture spatial relationships
that are lost in conventional convolution. Each capsule produces feature vectors.
For instance, CapsE [48]: each entity and relation is embedded into vectors,
similar to ConvKB. It concatenates the embeddings h, r, and t into a matrix
of shape d × 3, then applies E convolutional filters of size 1 × 3, resulting in
a d × E matrix. Each i-th row encodes distinct features of h[i], r[i], and t[i].
This matrix is then fed into a capsule layer, where each capsule (4.1.2) processes
a column, thus receiving feature-specific information from the input triple. A
second capsule layer is used to produce the final output.

ˆ Graph Attention Networks (GATs)

This category uses the attention mechanism [46], which has achieved notable
success in NLP tasks. For each embedding vector, information from neighboring
entities is aggregated using attention weights. These are then combined and
passed through a fully connected layer with learnable weights to obtain the final
embeddings. For example, GAT [47] applies multi-head attention to each training
triple to generate an embedding vector. This embedding is then transformed via a
weight matrix to produce a higher-dimensional vector that aggregates information
from neighboring nodes in the original triple. An improved version, KBGAT
[32], incorporates the relation embedding into the attention mechanism. These
methods will be discussed in detail in the subsequent sections.

ˆ Other methods

There are also other approaches, such as autoencoder-based techniques like Struc-
tural Deep Network Embedding (SDNE) [49].

Matrix Factorization
Matrix factorization-based graph embedding represents the structural characteris-
tics of a graph (e.g., similarity or proximity between vertex pairs) in the form of a
matrix and then factorizes this matrix to obtain vertex embeddings. The input for
this category of methods is typically high-dimensional non-relational features, and the
output is a set of vertex embeddings. There are two matrix factorization-based graph
embedding methods: Graph Laplacian Eigenmaps and Node Proximity Matrix Factor-
ization.

25
ˆ Graph Laplacian Eigenmaps

This approach preserves graph properties by analyzing similar vertex pairs and
heavily penalizes embeddings that place highly similar nodes far apart in the
embedding space.

ˆ Node Proximity Matrix Factorization

This approach approximates neighboring nodes in a low-dimensional space us-


ing matrix factorization techniques. The objective is to preserve neighborhood
information by minimizing the approximation loss.

Edge Reconstruction
The edge reconstruction method builds edges based on the vertex embeddings so
that the reconstructed graph is as similar as possible to the input graph. This method
either maximizes the edge reconstruction probability or minimizes edge reconstruction
loss. Additionally, the loss can be distance-based or margin-based ranking loss.

ˆ Maximize Edge Reconstruction Probability

In this method, a good vertex embedding maximizes the likelihood of generating


observed edges in the graph. In other words, a good vertex embedding should
allow for reconstructing the original input graph. This is achieved by maximizing
the generative probability of all observed edges using vertex embeddings.

ˆ Minimize Distance-Based Loss

In this approach, embeddings of neighboring nodes should be as close as possible


to the observed neighboring nodes in the original graph. Specifically, vertex prox-
imity can be measured using their embeddings or heuristically based on observed
edges. The difference between these two types of proximity is then minimized to
ensure consistent similarity.

ˆ Minimize Margin-Based Ranking Loss

In this approach, the edges in the input graph represent correlations between
vertex pairs. Some vertices in the graph are often linked with related vertex sets.
This method ensures that embeddings of related nodes are closer together than
unrelated ones by minimizing a margin-based ranking loss.

26
Graph Kernels
Graph kernel methods represent the entire graph structure as a vector containing
counts of basic substructures extracted from the graph. Subcategories of graph kernel
techniques include: graphlets, subtree patterns, and random walk-based methods.
This approach is designed to embed whole graphs, focusing only on global graph
features. The input is typically homogeneous graphs or graphs with auxiliary informa-
tion.
Generative Models
A generative model is defined by specifying a joint distribution over input features
and class labels, parameterized by a set of variables. There are two subcategories of
generative model-based methods: embedding graphs into latent space and incorporat-
ing semantics for embedding. Generative models can be applied to both node and edge
embeddings. They are commonly used to embed semantic information, with inputs
often being heterogeneous graphs or graphs with auxiliary attributes.

ˆ Embedding Graphs into Latent Semantic Space

In this group, vertices are embedded into a latent semantic space where the
distance between nodes captures the graph structure.

ˆ Incorporating Semantics for Embedding

In this method, each vertex is associated with graph semantics and should be
embedded closer to semantically relevant vertices. These semantic relationships
can be derived from descriptive nodes via a generative model.

Summary: Each graph embedding method has its own strengths and weaknesses,
which have been summarized by Cai, Hongyun [7] and are presented in Table 4.1. The
matrix factorization group learns embeddings by analyzing pairwise global similarities.
The deep learning group, in contrast, achieves promising results and is suitable for
graph embedding because it can learn complex representations from complex graph
structures.

27
Table 4.1: Comparison of Advantages and Disadvantages of Graph Embedding Tech-
niques
Method Subcategory Advantages Disadvantages
Matrix Factor- Graph Laplacian Considers global Requires large
ization Eigenmaps neighborhood computation time
Node Proximity structure and space
Matrix Factoriza-
tion
Maximize edge re- Only optimizes
Edge Relatively efficient
construction prob- local information,
Reconstruction training
ability e.g., first-order
Minimize distance- neighbors or
based loss ranked node pairs
Minimize margin-
based ranking loss
Based on graphlet Efficient, considers Substructures are
Graph Kernels Based on subtree only desired not independent.
patterns primitive structures Embedding
Based on random dimensionality
walk increases
Embedding is inter- exponentially
Generative Embedding into la- Difficult to tune
pretable
Model tent space distribution selec-
tion
Leverages multiple
Incorporate se- Requires a large
information sources
mantics amount of natu-
rally labeled data
Deep Learning With random walk Efficient and fast. Considers only
No need for manual local path content.
feature extraction Difficult to find
optimal sampling
strategy

28
Method Subcategory Advantages Disadvantages
Without random High computa-
walk tional cost

Graph embedding methods based on random walk in deep learning have lower com-
putational cost compared to those using full deep learning models. Traditional methods
often treat graphs as grids; however, this does not reflect the true nature of graphs. In
the edge reconstruction group, the objective function is optimized based on observed
edges or by ranking triplets. While this approach is more efficient, the resulting em-
bedding vectors do not account for the global structure of the graph. The graph kernel
methods transform graphs into vectors, enabling graph-level tasks such as graph clas-
sification. Therefore, they are only effective when the desired primitive structures in
a graph can be enumerated. The generative model group naturally integrates infor-
mation from multiple sources into a unified model. Embedding a graph into a latent
semantic space produces interpretable embedding vectors using semantics. However,
assuming the modeling of observations using specific distributions can be difficult to
justify. Moreover, generative approaches require a large amount of training data to
estimate a model that fits the data well. Hence, they may perform poorly on small
graphs or when only a few graphs are available.
Among these methods, deep learning-based graph embedding allows learning com-
plex representations and has shown the most promising results. Graph attention net-
works (GATs), which are based on attention mechanisms, aggregate the information
of an entity using attention weights from its neighbors relative to the central entity.
We believe this research direction is aligned with studies on the relationship between
attention and memory [34], where the distribution of attention determines the weight
or importance of one entity relative to another. Likewise, the embedding vector rep-
resenting an entity is influenced by the attention or importance of its neighboring
embeddings. Therefore, this is the approach we selected among the existing graph
embedding methods.

4.2 Multi-head Attention Mechanism


In 2014, the multi-head attention mechanism was introduced by Bahdanau, Dzmitry
[2], but it only gained widespread popularity in 2017 through the Transformer model

29
by Vaswani, Ashish [46]. The attention mechanism is an effective method to indicate
the importance of a word with respect to other words in a sentence, and it has been
shown to be a generalization of any convolution operation as reported by Cordonnier,
Jean-Baptiste [9]. To understand how multi-head attention is applied to graphs, in
this section we will detail the mechanism so we can better understand how it is used
in link prediction tasks within knowledge graphs.

4.2.1 Attention Mechanism


n o
The input of the attention mechanism consists of two embedding matrices X = →

x1 , →

x2 , ..., −
x−

Nx
n o
and Y = → −
y1 , →

y2 , ..., −→ , where each row ith or j th in matrix X or Y is an embedding
yN y


vector x ∈ R 1×D in →

, y ∈ R1×Din .
i j

The attention mechanism transforms input vectors of Din dimensions into output
vectors of Dattention dimensions to represent the importance of each of the Nx elements
x with respect to all Ny elements y. Given X ∈ RNx ×Din and Y ∈ RNy ×Din as input
embedding matrices, and H ∈ RNx ×Dattention as the output embedding matrix, the
attention mechanism introduced by Vaswani et al. [46] is defined as follows:

 QKT 
H = Attention(Q, K, V) = softmax √ V (4.1)
dk
where Q = XWQ , K = YWK , V = YWV .
The weight matrices WQ ∈ RDin ×Dk , WK ∈ RDin ×Dk , and WV ∈ RDin ×Dattention
are used to parameterize the transformation from input embedding vectors of dimension
Din into output embedding vectors of dimension Dk or Dattention . The term QKT

represents the dot product between each vector x and all vectors y. Dividing by dk
normalizes the result with respect to the vector dimension k. The result is then passed
through the softmax function to enable comparison of attention scores across different
 T

QK
pairs. We can interpret softmax √
dk
as the attention coefficients, indicating the
importance of each y with respect to each x. Finally, this is multiplied with the value
matrix V to produce the final output embedding of dimension Dattention .
If X = Y, we are computing the importance of each element with respect to other
elements in the same input matrix, which is referred to as the self-attention mechanism.

30
4.2.2 Multi-Head Attention
The multi-head attention mechanism is a way of combining multiple attention layers to
stabilize the learning process. Similar to the standard attention mechanism above, the
multi-head attention mechanism transforms the initial Nx embedding vectors of dimen-
sion Din into output embedding vectors of dimension Dmulti-head , aggregating informa-
tion from various other nodes to provide greater stability during training. Multi-head
attention stacks Nhead attention output matrices H, and then applies a weight ma-
trix to transform the original embedding matrix X ∈ RNx ×Din into a new embedding
matrix X′ ∈ RNx ×Dmulti-head using the following formula:

Nhead
!
n
X′ = H(h) WO
h=1
! (4.2)
Nhead
(h) (h) (h)
n
= Attention(XWQ , YWK , YWV ) WO
h=1
(h) (h) (h)
Here, the weight matrices WQ , WK ∈ RDin ×Dk , and WV ∈ RDin ×Dattention
correspond to each individual attention head h ∈ [Nhead ]. The output projection matrix
WO ∈ RNhead Dattention ×Dmulti-head parameterizes the transformation of the concatenated
output heads into the final output embedding matrix.
At this point, we have presented the attention mechanism for computing attention
scores and aggregating embedding information from neighboring vectors. In the next
section, we will describe how this attention mechanism is applied to knowledge graphs.

4.3 Graph Attention Network


With the success of the multi-head attention mechanism in natural language processing,
it has also been studied for applications in image processing models [37]. Consequently,
the multi-head attention mechanism has been explored for replacing convolutional op-
erations in knowledge graph embedding models, such as Graph Convolutional Networks
(GCNs [23]). In this section, we present in detail how the attention mechanism from
4.2.1 is applied to graph embedding via the Graph Attention Network (GAT [47])
method.
The input to the graph attention network model is a set of embedding vectors,
randomly initialized from a normal distribution, representing features of each entity:

31
Donald Jeff
of Trump Bezos
wife pre f
Melania sid
ent to
es
Trump first lad αi
4
of ir ch i1 e1
r1 e2 e4
y α r2
r7 r3
αi5
U.S
→ e3
f r8 r6
native o fo r5
un r4
αi6 of de e5
Tom te d e6 e7
sta α in
born i2
Cruise in α i3

New
Tesla Inc.
York

Figure 4.7: Knowledge graph and normalized attention coefficients of the entity

n o
E = →

e1 , →

e2 , ..., −→ . The objective of the model is to transform this into a new
eN e
n→− →− −→o
output embedding matrix E′′ = e′′1 , e′′2 , ..., e′′Ne capable of aggregating embedding
′′
information from neighboring entities. Here, E ∈ RNe ×Din and E′′ ∈ RNe ×D denote
the input and output embedding matrices for the entity set, respectively. Ne is the
number of entities, and Din , D′′ are the dimensions of the input and output embeddings.
Similar to the multi-head attention mechanism introduced in 4.2.1, the application
of this mechanism to a knowledge graph follows the same logic as the self-attention
mechanism, in which each node attends to all other nodes in the graph. However,
computing attention scores between every pair of nodes in a graph is not meaningful
if no relationship exists between them, and it would incur significant computational
overhead. Therefore, the model applies a technique known as masked attention, in
which all attention scores corresponding to unrelated nodes in the graph are ignored.
These relevant connections are precisely defined as the first-order proximity (Definition
5) of a node in the graph. Thus, in this context, we let X = Y = E (as in 4.2.1), and
the attention coefficient in the masked attention mechanism represents the importance
of a neighboring node j ∈ Ni to the central node i, where Ni is the set of all first-order
neighbors of node i (including i itself).
The application of the multi-head attention mechanism (multi-head attention) in
4.2 to graphs is described as follows:

eij = fmask attention (W→



ei , W→

ej ) (4.3)

where eij denotes the multi-head attention coefficient of an edge (ei , ej ) with respect
to the central entity ei in the knowledge graph Gknow . W is a weight matrix that

32
parameterizes the linear transformation. fmask attention is the function applying the
attention mechanism.
In the GAT model, each entity vector embedding →

ei undergoes two transformation
stages. The entire model consists of two transformation steps, each applying the multi-
head attention mechanism as follows:

fmask attention →
− fmask attention →

(1) (2)


ei −−−−−−−−→ e′i −− −−−−−−→ e′′i (4.4)
(1)
In the first multi-head attention step (fmask attention ), the model aggregates informa-

− →
− ′
tion from neighboring entities and stacks them to produce vector e′i , where e′i ∈ R1×D .
(2)
In the second step (fmask attention ), the multi-head attention layer is no longer sensitive
to self-attention; therefore, the output is computed as an average instead of concate-


nating attention heads. The vector e′i is then treated as the input embedding to be

− →
− ′′
transformed into the final output embedding vector e′′i , with e′′i ∈ R1×D .
First, similar to the attention mechanism in 4.1, each embedding vector is multiplied
by a weight matrix W1 ∈ RDk ×Din to parameterize the linear transformation from Din
input dimensions to Dk higher-level feature dimensions:



hi = W1 →

ei (4.5)


where →

ei ∈ RDin ×1 →
− hi ∈ RDk ×1
Next, we concatenate each pair of linearly transformed entity embedding vectors to
compute the attention coefficients. The attention coefficient eij reflects the importance
of the edge feature (ei , ej ) with respect to the central entity ei , or in other words, the
importance of a neighboring entity ej that is connected to ei . We apply the LeakyReLU
function to extract the absolute value of the attention coefficient. Each attention
coefficient eij is computed using the following equation:

−−→ →
T − →
− 

eij = LeakyReLU W2 [ hi ||hj ] (4.6)

where .T denotes the transpose operation and || represents concatenation. This


is similar to 4.1, however instead of using a dot product, we use a shared attentional
−−→
mechanism W2 : RDk × RDk → R to compute the attention scores. As mentioned in
4.3, we perform self-attention between all nodes using the masked attention mechanism
to discard all irrelevant structural information.

33
To enable meaningful comparison between the attention coefficients of neighboring
entities, a softmax function is applied to normalize the coefficients over all neighbors
ej that are connected to the central entity ei : αij = softmaxj (eij ). Combining all of
this, we obtain the final normalized attention coefficient of each neighbor with respect
to the central entity as follows:
−−→ →
T − →
− 

exp LeakyReLU W2 [ hi ||hj ] )
αij = P  −−→ →− −
→  (4.7)
T
k∈Ni exp LeakyReLU W2 [ hi ||hk ] )

At this stage, the GAT model operates similarly to GCN [23], where the embed-
ding vectors from neighboring nodes are aggregated and scaled by their corresponding
normalized attention coefficients:
 

−′ X → −
ei = σ  αij hj  (4.8)
j∈Ni

Similar to the multi-head attention layer, we concatenate Nhead attention heads to


(1)
stabilize the training process in the first step (fmask attention 4.4) of the model:
 
Nhead

−′ n X
h →

ei = σ αij Wh hj  (4.9)
h=1 j∈Ni

h is the normalized attention


where σ is any non-linear activation function, and αij
coefficient for edge (ei , ej ) computed from the hth attention head. Similar to equation
4.1, Wh is the weight matrix used for linear transformation of the input embedding
vector, with each Wh corresponding to a different attention head. The resulting new

− ′
embedding vector e′i ∈ R1×D , where D′ = Nhead Dk , is then used as the input for the
(2)
next attention layer. However, in the second step (fmask attention 4.4), the multi-head
attention outputs are averaged instead of concatenated, as shown below:
 
Nhead

−′′ 1 X X →

ei = σ  h
αij Wh e′j  (4.10)
Nhead
h=1 j∈Ni

Summary: Up to this point, we have presented how the attention mechanism


aggregates a knowledge graph entity embedding vector from its neighboring embed-
dings and concatenates them to produce the final embedding. In the next section, we

34
will present our complete embedding model based on the KBGAT model proposed by
Nathani, Deepak[32].

4.4 KBGAT Model


In a knowledge graph, a single entity cannot fully represent an edge, as the entity
can play multiple roles depending on the type of relation. For example, in Figure 4.7,
Donald Trump serves both as a president and as a husband. To address this issue,
the knowledge graph attention-based embedding model — KBGAT (graph attention
based embeddings [32]) — improves upon the GAT model by incorporating additional
information from relations and neighboring node features into the attention mechanism.
In this section, we will detail the KBGAT model. The structure of KBGAT follows
an encoder-decoder framework, where the encoder is implemented using the Graph
Attention Network (GAT), and the decoder uses the ConvKB model for prediction.
The steps of the KBGAT model are illustrated in 4.11.

4.4.1 4.4.2 4.4.3


entities −−−−−−−−−→ −
e−−−→ Embedding −−−−−→ ConvKB
Embedding
TransE −−−−−−−−−→ eKBGAT −
−−−−−−−→ eprob (4.11)

First, the embedding vectors of each entity are initialized using the TransE model
to capture the spatial characteristics among nodes and obtain the initial embeddings.
These embeddings are then further trained using an encoder model to capture neigh-
borhood features, resulting in updated embeddings. Finally, these embeddings are
passed through a prediction layer using the ConvKB model. All equations presented
here are based on those in the work of Nathani, Deepak[32].

4.4.1 Embedding Initialization


Similar to the Word2Vec method [30], the model Translating Embeddings for Modeling
Multi-relational Data (TransE [5]) belongs to the group of geometric embedding meth-
ods that transform entities and relations in a knowledge graph into output embedding
vectors such that:
−−−−−−→ −−−−−→ −−−−−→
entityhead + relation ≈ entitytail (4.12)

Initially, the entity and relation embedding vectors are randomly initialized using
a normal distribution with dimensionality Din , and then normalized according to the
size of the entity and relation embedding sets.

35
−−−−−→
y tailinvalid

−−−−→
tailvalid
−−−−−→
relation
−−−−−−→
headinvalid

−−−−−→
headvalid
x

Figure 4.8: Illustration of embedding vectors in the TransE model

Next, we perform sampling from the training dataset to obtain a batch of valid
triples (Sbatch ). For each such triple, we sample an invalid triple by replacing either
the head or the tail entity with a random entity from the entity set, yielding a batch

of invalid triples (Sbatch ). We then pair each valid triple with an invalid one to form
the training batch (Tbatch ). Finally, we update the embedding vectors to satisfy the
condition in 4.12.

36
Algorithm 5 TransE Embedding Learning Algorithm [5]
Input : Training set S = (h, r, t), entity set E, relation set R, margin γ, embedding
dimension Din
Initialize
1:

−r ← uniform(− √D 6 6
, √D ) for each relation r ∈ R
in in

− −
→r
2: r ← ∥−

r∥
for each r ∈ R
3:

− 6
e ← uniform(− √D 6
, √D ) for each entity e ∈ E
in in
4: loop

− −

e
5: e ← ∥−

e∥
for each e ∈ E
6: Sbatch ← sample(S, b) // sample minibatch of size b
7: Tbatch ← ∅
8: for (h, r, t) ∈ Sbatch do
9: (h′ , r, t′ ) ← sample(S(h,r,t)
′ ) // sample from invalid triple set
n o
10: Tbatch ← Tbatch ∪ (h, r, t), (h′ , r, t′ )

Update embeddings

− − → − →
− − → −
∇[γ + d( h + →
r , t ) − d( h′ + →
r , t′ )]+
P
11: 
(h,r,t),(h′ ,r,t′ ) ∈Tbatch

Output : A set of embedding vectors with dimension Din representing entities and
relations

The TransE model proposed by Bordes, Antoine [5] is presented in Algorithm 5.


The input of the TransE model is a training dataset where each element is a triple
(h, r, t). Here, h, t ∈ E are the head and tail entities, and r ∈ R is the relation. →

e and


r are the embedding vectors of entities and relations respectively, and ∥→

e ∥ and ∥→

r∥
denote the cardinalities of the entity and relation sets. S and Sbatch represent the full
training dataset and a sampled batch from it, respectively. Tbatch is a batch containing
both valid and invalid triples used to compute the loss function 4.13.
A valid triple is one directly sampled from the training set (Sbatch ), while an invalid

triple is constructed by corrupting a valid triple (Sbatch ) by replacing either the head
or the tail entity with a randomly selected entity from the entity set:


= (h′ , r, t)|h′ ∈ E ∪ (h, r, t′ )|t′ ∈ E
 
S(h,r,t) (4.13)

− − →

To achieve the goal of learning embedding vectors such that h + →
r ≈ t , the model

37

− →
− −
aims for the tail embedding t of valid triples to lie close to h + →r , while in invalid

−′ → →
− →
− →
− −
triples, the corrupted embedding h + − r (or t′ ) should lie far from t (or h + → r ),
according to the following margin-based ranking loss function:

X X
L= [d − d′ + γ]+ (4.14)
(h,r,t)∈S ′
(h′ ,r,t′ )∈S(h,r,t)

Figure 4.9: TransE embedding model

Here, γ > 0 is the margin, and h′ and t′ are entities sampled as defined in Equation
4.13. ∆ and ∆′ represent the difference vectors for the embeddings in valid and invalid

− − → − →
− −
triples, respectively, with d = ∆ 1 = h + → r − t 1 and d′ = ∆′ 1 = h′ + → r −

−′
t 1 , where ∥ · ∥1 denotes the L1 norm.

− − →

As illustrated in Figure 4.9, if d > d′ or d − d′ > 0, then h + →
r is closer to t than


to t′ . Since we want the embedding vectors to satisfy the condition in Equation 4.12,

− → →
− →
− − →

h +− r should be as close as possible to t . That means, the closer h + → r is to t′ ,
the more incorrect it becomes.
Therefore, during training, we aim for ∆′ to be as large as possible relative to ∆.
If ∆′ > ∆ or d′ − d > 0, there is no need to update the embedding weights further.
Hence, in the loss function 4.14, the term [d − d′ + γ]+ captures only the positive part
because the negative part already satisfies the correctness of the condition in Equation
4.12 during training.

38
4.4.2 Encoder Model
After obtaining the embedding vectors that capture spatial features of the knowledge
graph, these embeddings are passed through the next embedding layer to further ag-
gregate the neighborhood information of each entity.

Attention
Head 1 N4.21
Triple 1

Donald Trump

president of Graph Graph


United States Attention Attention L4.14
Layer 14.18 Layer 24.20

Triple N

Tom Cruise
Attention
born in
Head 2
New York

Figure 4.10: Illustration of the encoder layers in the KBGAT model

The model transforms the entity embedding matrix

n o n→
−′′ →
−′′ −′′→o
E= →

e1 , →

e2 , ..., −→
eN →
− E =′′
e1 , e2 , ..., eNe ,
e

′′
with E ∈ RNe ×Din and E′′ ∈ RNe ×D .
Simultaneously, it transforms the relation embedding matrix

n o n→
−′′ →
−′′ −→ o
R= →

r1 , →

r2 , ..., −→
rN →
− R =′′ ′′
r1 , r2 , ..., rNr ,
r

′′
with R ∈ RNr ×Pin and R′′ ∈ RNr ×P .
Similar to the GAT model described in Section 4.3, the model transforms the entity
embedding vectors from Din dimensions to D′′ dimensions by aggregating neighborhood
information through attention coefficients. Pin and P ′′ are the input and output di-
mensions of the relation embedding vectors, respectively. Ne and Nr are the sizes of
the entity and relation sets in Gknow , respectively.
The KBGAT model concatenates the entity and relation embedding vectors accord-
ing to the following structure:

39
−→
tijk = W1 [→

ei ||→

ej ||→

rk ] (4.15)
−→
Here, tijk is the embedding vector representing the triple tkij = (ei , rk , ej ), where ej
and rk are the neighboring entity and the relation connecting the source node ei to the
node ej . W1 ∈ RDk ×(2Din +Pin ) is a weight matrix that performs a linear transformation
of the concatenated input vectors into a new vector with dimensionality Dk . These
weight matrices are either randomly initialized using a normal distribution or pre-
trained using the TransE model [5].
Similar to Equation 4.7 in the GAT model, we need to compute the attention
coefficient for each edge with respect to a given node. Then, the softmax function is
applied to normalize these coefficients as follows:

−→
αijk = softmaxjk (LeakyReLU(W2 tijk ))
−→
  
exp LeakyReLU W2 tijk (4.16)
=P
−→
P   
n∈Ni r∈Rin exp LeakyReLU W t
2 inr

where Ni denotes the set of neighbors of the central node ei within nhop hops; Rin
represents the set of all relations that exist along the paths connecting the source entity
−→
ei to a neighboring entity en ∈ Ni . Similar to Equation 4.8, the embedding vectors tkij
are scaled by their corresponding normalized attention coefficients:

 

−′ X X −→
ei = σ  αijk tijk  (4.17)
j∈Ni k∈Rij

Similar to the multi-head attention mechanism in Equation 4.9, we concatenate


Nhead attention heads to stabilize the learning process:
 

−′ Nhead X −→
(h) (h)
n
ei= σ αijk tijk  (4.18)
h=1 j∈Ni

Likewise, for relation embeddings, a weight matrix WR is used to perform a linear


transformation from the original relation embedding of dimension P to a new dimension
P ′:

40

R′ = RWR ; where: WR ∈ RP ×P (4.19)

′ ′
At this stage, we have obtained two matrices: H′ ∈ RNe ×D and R′ ∈ RNr ×P ,
which are the updated entity and relation embedding matrices, respectively, with new
dimensions. The model proceeds through the final attention layer, taking as input the
newly updated entity and relation embeddings as shown in Equation 4.10. However, if
we apply multi-head attention at this final layer for prediction, the concatenation op-
eration will no longer be sensitive to the self-attention mechanism. Therefore, instead
of concatenating, the model averages the outputs and then applies a final non-linear
activation function:
 
Nhead −−→

−′′ 1 X X X ′(h) ′(h)
ei = σ  αijk tijk  (4.20)
Nhead
h=1 j∈Ni k∈Rij

′(h) ′(h)
where αijk and tijk denote the normalized attention coefficients and the triple
embedding vectors for (ei , rk , ej ) in attention head (h), respectively.
Up to this point, the KBGAT model functions similarly to the GAT model in
Section 4.3, but it additionally incorporates both entity embedding information and
neighbor nodes up to nhop hops. This results in the final entity embedding matrix
′′ ′′
E′′ ∈ RNe ×D and the final relation embedding matrix R′′ ∈ RNr ×P .
However, after the embedding learning process, the final entity embedding matrix
E′′ may lose the initial embedding information due to the vanishing gradient prob-
lem. To address this, the model employs residual learning, by projecting the initial
′′
embedding matrix E through a weight matrix WE ∈ RDin ×D , and then directly
adding it to the final embedding, thus preserving the initial embedding information
during training:

H = WE E + E′′ (4.21)

Finally, the training datasets are sampled to generate valid triples and invalid
triples, similar to the TransE model described above, in order to learn the embed-
ding vectors. However, the distance between embedding vectors is computed using the
L1 norm as follows:
dtij = h⃗i + g⃗k − h⃗j 1

41
Similarly, we train the model using a margin-based loss function:

X X
L(Ω) = max{dt′ij − dtij + γ, 0} (4.22)
tij ∈S t′ij ∈S ′

where γ > 0 is the margin parameter, S is the set of valid triples, and S ′ is the set
of invalid triples, defined as:

S ′ = {tki′ j |e′i ∈ E \ ei } ∪ {tkij ′ |e′j ∈ E \ ej } (4.23)


| {z } | {z }
head entity replacement tail entity replacement
The output of the KBGAT model is the set of entity and relation embedding vectors,
which are subsequently fed into the ConvKB model for final prediction.

4.4.3 ConvKB Prediction Model

h r t
3.ω 1×3
score

Figure 4.11: Illustration of the decoder layers of the ConvKB model with 3 filters

After mapping entities and relations into a low-dimensional space, the model em-
ploys ConvKB [33] to analyze the global features of a triple tijk across each dimension,
thereby generalizing transformation patterns through convolutional layers. The scoring
function with the learned feature mappings is defined as follows:

 Ω →
− 

− →

n
m
f (tijk ) = ReLU([ ei , rk , hj ] ∗ ω ) .W (4.24)
m=1

42
where ω m denotes the m-th convolutional filter, Ω is the hyperparameter repre-
senting the number of convolutional layers, ∗ denotes the convolution operation, and
W ∈ RΩk×1 is the linear transformation matrix used to compute the final score for the
triple.
The model is trained using a soft-margin loss function as follows:

X λ
L= log(1 + exp(ltk · f (tkij ))) + ∥ W ∥22 (4.25)
ij 2
tkij ∈{S∪S ′ }

where

1 for tkij ∈ S
ltk =
ij −1 for tk ∈ S ′
ij

The final output of the ConvKB model is the ranking score corresponding to each
prediction.

43
Chapter 5. EXPERIMENTS

1.23 · 105
5
1.2 · 10

1 · 105

80,000

60,000

40,943 40,559 Entities


40,000 Relations

20,000 14,951 14,541

1,345 237 18 11 37
0
FB15k FB15k-237 WN18 WN18RR YAGO3-10

In this section, we describe the datasets used for our empirical evaluation, along
with a comparison against notable existing methods as reported in Table 4.1. Addi-
tionally, we evaluate our two proposed approaches for injecting new knowledge into
the knowledge graph. Specifically, we treat the test set as a batch of new knowledge
to be added, and use the validation set to re-evaluate the effectiveness of our method.
Detailed results are presented in Table 5.2 and Table 5.3.

# Edges
Dataset Entities Relations Training Validation Test
FB15k 14,951 1,345 483,142 50,000 59,071
FB15k-237 14,541 237 272,115 17,535 20,466
WN18 40,943 18 141,442 5,000 5,000
WN18RR 40,559 11 86,835 3,034 3,134
YAGO3-10 123,182 37 1,079,040 5,000 5,000

Table 5.1: Dataset Information

44
5.1 Training Datasets

FB15k 592,213

FB15k-237 310,116

WN18 151,442

WN18RR 93,003

YAGO3-10 1,089,040

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 1,100,000

Training Validation Test

In our experiments, we evaluate our approach on four widely used benchmark datasets:
FB15k, FB15k-237 ([43]), WN18, and WN18RR ([12]). Each dataset is divided into
three subsets: training, validation, and test sets. Detailed statistics for these datasets
are presented in Table 5.1.
Each dataset consists of a collection of triples in the form ⟨head, relation, tail⟩.
FB15k and WN18 are derived from the larger knowledge bases FreeBase and WordNet,
respectively. However, they contain a large number of inverse relations, which allow
most triples to be easily inferred. To address this issue and to better reflect real-world
link prediction scenarios, FB15k-237 and WN18RR were constructed by removing such
inverse relations.

5.1.1 FB15k Dataset


This dataset was constructed by the research group of A. Bordes and N. Usunier
[5] by extracting data from the Wikilinks database1 . The Wikilinks database contains
hyperlinks to Wikipedia articles, comprising over 40 million mentions of approximately
3 million entities. The authors extracted all facts related to a given entity that is
mentioned at least 100 times across different documents, along with all facts associated
with those entities (including child entities mentioned in the corresponding Wikipedia
articles), excluding attributes such as dates, proper nouns, etc.
Entities with degree n were normalized by converting multi-way edges into sets of
binary relations—enumerating all edges and relations between each pair of nodes.
1 https://code.google.com/archive/p/wiki-links/

45
5.1.2 FB15k-237 Dataset
This dataset is a subset of FB15k, constructed by Toutanova and Chen [43], motivated
by the observation that FB15k suffers from test leakage, where models are exposed to
test facts during training. This issue arises due to the presence of duplicate or inverse
relations in FB15k. FB15k-237 was created to provide a more challenging benchmark.
The authors selected facts related to the 401 most frequent relations and eliminated all
redundant or inverse relations. Additionally, they ensured that no entities connected
in the training set were directly linked in the test and validation sets.

5.1.3 WN18 Dataset

34,79634,832

29,715
30,000

20,000

10,000
7,382 7,402
4,816 4,805
2,921 2,935 3,118 3,116
1,299 923 903 1,138
629 80 632
0
rm

rt

ym

ym

ym

on

of

of

of

of

p
e

to
pi

ag
se

ou
pa

ny

ny

ny

gi

rt

e
fo

to
on

on

on

r
us

gr
pi

ag
io
ila
so

re
er

er

lo

pa
s

er
d

to
g
n
ha

us
al

ho
p

rb
te

hy

hy

re
n

ai
m
hy

hy

ai
ai

si

n
la

ve
m

n
r

ai
ce

m
r
be

ai
re

do
ce

ai
be

do

om
do
an

m
em

m
an
lly

em

do
of
st

of

do

d
of
st

m
na

m
in

et
in

et
be
r

et
be
io

ns
be

ns
ns
em
t

em

sy
va

em

sy
sy
m
ri

m
m
de

This dataset was introduced by the authors of TransE [5], and is extracted from
WordNet2 , a lexical knowledge graph ontology designed to provide a dictionary/the-
saurus to support NLP tasks and automated text analysis. In WordNet, entities corre-
spond to synsets (i.e., word senses), and relations represent lexical connections among
them (e.g., “hypernym”).
To construct WN18, the authors used WordNet as a starting point and iteratively
filtered out entities and relations that were infrequently mentioned.
2 https://wordnet.princeton.edu/

46
5.1.4 WN18RR Dataset
This dataset is a subset of WN18, constructed by DeŠmers et al.[11], who also addressed
the issue of test leakage in WN18. To tackle this issue, they created the WN18RR
dataset, which is significantly more challenging, by applying a similar methodology to
that used in FB15k-237 [43].

5.2 Evaluation Metrics


In this section, we describe the evaluation metrics, experimental environment, and
datasets used to assess the proposed method. These metrics are widely adopted for
evaluating link prediction models on knowledge graphs. We compare our approach
against four other state-of-the-art methods reported in [38].

5.2.0.1 Hits@K (H@K)

This metric measures the proportion of correct predictions whose rank is less than or
equal to the threshold K:

| {q ∈ Q : rank(q) ≤ K} |
H@K =
|Q|

5.2.0.2 Mean Rank (MR)

This metric calculates the average rank of the correct entity in the prediction. A lower
value indicates better model performance:

1 X
MR = rank(q)
|Q|
q∈Q

Here, | Q | denotes the total number of queries, which equals the size of the test or
validation set. During evaluation, we perform both head and tail entity predictions for
each triple. For example, we predict both ⟨?, relation, tail⟩ and ⟨head, relation, ?⟩.
The variable q denotes a query, and rank(q) indicates the rank position of the correct
entity. The final MR score is the average rank over all head and tail predictions.
Clearly, this metric ranges from [1, |number of entities|], as a node can connect to
at most n − 1 other nodes plus a self-loop. However, this metric is highly sensitive to
outliers, as certain relations may yield extremely low rankings for correct entities. To

47
address this issue, our method—as well as other recent works—also adopts the Mean
Reciprocal Rank (MRR) metric.

5.2.0.3 Mean Reciprocal Rank (MMR)

This is the Mean Reciprocal Rank (MRR), calculated as the reciprocal of the average
rank obtained for a correct prediction. Higher values indicate better model perfor-
mance. Since this metric takes the reciprocal of each rank, it helps mitigate the noise
sensitivity encountered in the Mean Rank (MR) metric:

1 X 1
M RR =
|Q| rank(q)
q∈Q

5.3 Training Methodology

5.3.1 Training with the KBGAT Model


We first initialize the entity and relation embeddings using the TransE model [5]. To
construct negative (invalid) triples, we randomly replace either the head or tail entity
in a valid triple with another entity sampled from the entity set.
The training process is divided into two phases. The first phase serves as an encoder,
transforming the initialized embeddings into new embeddings that aggregate neighbor-
hood information using the KBGAT model. This produces updated embeddings for
both entities and relations. The second phase serves as a decoder, performing link
prediction by incorporating n-hop neighborhood information. This enables the model
to better aggregate context from neighboring entities. Furthermore, we incorporate
auxiliary relations to enrich the local structure in sparse graphs.
We use the Adam optimizer with a learning rate of µ = 0.001. The final embedding
dimension for both entities and relations is set to 200. The optimal hyperparameters
are determined via grid search, as detailed in Appendix A.

5.4 Experimental Results


As previously mentioned, our rule-based model can be fully executed on a standard
laptop. In our experiment, the machine configuration was as follows: T480, Core i5
8th Gen, 16GB RAM, 4 cores and 8 threads. The source code was implemented in

48
Python version 3.6, utilizing only built-in Python functions without any third-party
libraries. The experiments were conducted on four widely-used datasets: FB15k, FB15-
237, WN18, and WN18RR. Detailed information about these datasets is provided in
Section 5.1 under the training datasets section.
As described in Algorithm 2, the AnyBURL algorithm learns rules generated within
a user-configurable time interval. Here, we set the training time to 1000 seconds (ap-
proximately 17 minutes), with saturation (SAT) set to 0.85, confidence threshold Q set
1
to 0.05, and sample size S configured as ( 10 of the training set). With this setup, our
Python version of the model produced results comparable to the Java version developed
by Meilicke, Christian et al. [27], which was configured similarly but trained for only
100 seconds. The difference in training time is primarily due to performance differences
between Python and Java. We chose Python because it is the primary language used
in many recent artificial intelligence models and provides convenience for performance
comparison and evaluation with other deep learning methods, most of which are also
implemented in Python.

FB15k FB15k-237
H@1 H@10 MR MRR H@1 H@10 MR MRR
ComplEx 81.56 90.53 34 0.848 25.72 52.97 202 0.349
TuckER 72.89 88.88 39 0.788 25.90 53.61 162 0.352
TransE 49.36 84.73 45 0.628 21.72 49.65 209 0.31
RoteE 73.93 88.10 42 0.791 23.83 53.06 178 0.336
ConvKB 59.46 84.94 51 0.688 21.90 47.62 281 0.305
KBGAT 70.08 91.64 38 0.784 36.06 58.32 211 0.4353
AnyBURL 79.13 82.30 285 0.824 20.85 42.40 490 0.311

Table 5.2: Experimental results on the FB15k and FB15k-237 datasets

49
WN18 WN18RR
H@1 H@10 MR MRR H@1 H@10 MR MRR
ComplEx 94.53 95.50 3623 0.349 42.55 52.12 4909 0.458
TuckER 94.64 95.80 510 0.951 42.95 51.40 6239 0.459
TransE 40.56 94.87 279 0.646 2.79 94.87 279 0.646
RoteE 94.30 96.02 274 0.949 42.60 57.35 3318 0.475
ConvKB 93.89 95.68 413 0.945 38.99 50.75 4944 0.427
KBGAT 35.12 57.01 1974 0.4301
AnyBURL 93.96 95.07 230 0.955 44.22 54.40 2533 0.497

Table 5.3: Experimental results on the WN18 and WN18RR datasets

Table 5.2 and Table 5.3 describe our experimental results with the H@K metrics
along with the experimental results of other methods mentioned in the survey [38].

AnyBURL Batch edge AnyBURL Edge AnyBURL


hit@10 82.22 82.48 83.08
FB-15k MR 285 250 220
MRR 0.824 0.853 0.866
hit@10 42.40 43.40 43.51
FB15k-237 MR 490 472 441
MRR 0.311 0.353 0.377
hit@10 95.07 95.09 95.19
WN18 MR 230 229 228
MRR 0.955 0.955 0.956
hit@10 54.40 54.63 54.70
WN18RR MR 2533 2346 2215
MRR 0.497 0.553 0.581

Table 5.4: Accuracy results of the two new knowledge addition strategies

Table 5.4 describes our experimental results for the two strategies of adding new
knowledge into the graph. We evaluate the total number of generated rules, as well as
the number of rules with confidence >= 50% and >= 80%.

50
Batch edge AnyBURL Edge AnyBURL
num rule 1011 1367
FB15k confidence 50% 416 (41,14%) 1185 (86,69%)
confidence 80% 284 (28, 09%) 481 (35,18%)
num rule 1120 756
FB15k-237 confidence 50% 244 (21,79%) 660 (87,30%)
confidence 80% 95 (8,48%) 162 (21,43%)
num rule 533 260
WN18 confidence 50% 270 (38, 46 %) 252 (96,92%)
confidence 80% 240 (34,19%) 225 (86,54%)
num rule 439 106
WN18RR confidence 50% 110 (25,05%) 102 (96,22%)
confidence 80% 83 (18,91%) 85 (81,19%)

Table 5.5: Evaluation results on the number of rules of the two new knowledge addition
strategies

51
Chapter 6. CONCLUSION

In this section, we presented the results achieved by the proposed model, along with
detailed analyses on different datasets to clarify both the strengths and the remaining
limitations. From this, we identified potential research directions for improving the
model in the future.
Although our rule-based method demonstrates performance comparable to mod-
ern deep learning models (state-of-the-art), and clearly outperforms them in terms
of training time—only about 17 minutes compared to several hours for deep learning
models—this does not imply that deep learning models are not worth studying. On
the contrary, through performance analysis across different datasets, we observed that
for datasets with diverse relations like FreeBase, the KBGAT model using attention
mechanisms yielded significantly better results than on datasets like WordNet, which
contain fewer relation types. This highlights the potential of leveraging deep learning
mechanisms tailored to the specific characteristics of each dataset.
This shows that the attention mechanism, by incorporating relational embedding
information, helps to better capture graph structures in datasets with a wide variety
of relations. For datasets with many similar and inverse triples such as FB15k and
WN18RR, the rule-based model AnyBURL achieved superior results, whereas deep
learning methods only achieved average performance compared to other methods. The
rule-based AnyBURL model performs better on datasets like FB15k and WN18RR;
however, for datasets that have removed similar or inverse information, such as FB15k-
237 and WN18RR, the rule-based method is less effective, since it relies on previously
observed paths or links. In contrast, deep learning models represent relations and
entities in a vector space to learn their interactions, allowing them to perform better
on datasets like FB15k-237 and WN18RR than on FB15k and WN18.
One of the main advantages of the rule-based approach is that the generated rules
are interpretable during training and require significantly less training time compared
to other methods. However, after the training phase, the rule-based method must

52
iterate through all learned rules to make predictions. This is an area where deep
learning models show better performance, as models like KBGAT can use the learned
weights and computational layers to transform inputs into probabilistic predictions
much faster. The drawback of deep learning approaches is their lack of interpretability
during training, as well as the high computational cost. Regarding our two proposed
algorithms for adding new knowledge to the graph, we found them to significantly
outperform deep learning methods.
The graph embedding process helps represent the features of entities, relations, or
the characteristics of the knowledge graph as lower-dimensional vectors (Section 4.1).
However, in practice, a piece of knowledge represented by entities and relations is
entirely independent between these components; thus, they should be embedded into
vectors with different dimensionalities. The ratio of dimensions between entities and
relations is also an important issue that requires further investigation.
Additionally, in the real world, the temporal factor is a critical piece of information
that can completely alter the meaning of a piece of knowledge. Therefore, integrating
temporal information into the attention mechanism is one of the research directions we
aim to pursue to ensure the semantic accuracy of the knowledge graph.
For the rule-based method AnyBURL, the reinforcement learning branch has re-
cently seen significant progress, and the authors Meilicke, Christian and Chekol [26]
have recently proposed a study to optimize the AnyBURL method using reinforcement
learning. We also intend to explore this direction and aim to report our findings in the
near future.

53
REFERENCES

[1] Franz Baader and Tobias Nipkow. Term rewriting and all that. Cambridge uni-
versity press, 1999.

[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans-
lation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,
2014.

[3] Shumeet Baluja, Rohan Seth, Dharshi Sivakumar, Yushi Jing, Jay Yagnik,
Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. Video suggestion
and discovery for youtube: taking random walks through the view graph. In Pro-
ceedings of the 17th international conference on World Wide Web, pages 895–904,
2008.

[4] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan


Günnemann. Netgan: Generating graphs via random walks. arXiv preprint
arXiv:1803.00816, 2018.

[5] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok-
sana Yakhnenko. Translating embeddings for modeling multi-relational data. In
Advances in neural information processing systems, pages 2787–2795, 2013.

[6] Samuel R Buss. An introduction to proof theory. Handbook of proof theory, 137:1–
78, 1998.

[7] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. A comprehen-
sive survey of graph embedding: Problems, techniques, and applications. IEEE
Transactions on Knowledge and Data Engineering, 30(9):1616–1637, 2018.

[8] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representa-
tions with global structural information. In Proceedings of the 24th ACM interna-

54
tional on conference on information and knowledge management, pages 891–900,
2015.

[9] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship
between self-attention and convolutional layers. arXiv preprint arXiv:1911.03584,
2019.

[10] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. Multi-head atten-
tion: Collaborate instead of concatenate. arXiv preprint arXiv:2006.16362, 2020.

[11] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Con-
volutional 2d knowledge graph embeddings. arXiv preprint arXiv:1707.01476,
2017.

[12] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Con-
volutional 2d knowledge graph embeddings. In Thirty-Second AAAI Conference
on Artificial Intelligence, 2018.

[13] Luis Galárraga, Christina Teflioudi, Katja Hose, and Fabian M. Suchanek. Amie:
association rule mining under incomplete evidence in ontological knowledge bases.
In WWW ’13, 2013.

[14] Luis Galárraga, Christina Teflioudi, Katja Hose, and Fabian M Suchanek. Fast rule
mining in ontological knowledge bases with amie. The VLDB Journal, 24(6):707–
730, 2015.

[15] Google. Introducing the Knowledge Graph: things, not strings, 2020 (accessed on
August 27, 2020).

[16] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and
performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.

[17] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for net-
works. In Proceedings of the 22nd ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 855–864, 2016.

[18] Lingbing Guo, Zequn Sun, and Wei Hu. Learning to exploit long-term relational
dependencies in knowledge graphs. arXiv preprint arXiv:1905.04914, 2019.

[19] Wilfrid Hodges et al. A shorter model theory. Cambridge university press, 1997.

55
[20] John J Hopfield. Hopfield network. Scholarpedia, 2(5):1977, 2007.

[21] Sergey Ivanov and Evgeny Burnaev. Anonymous walk embeddings. arXiv preprint
arXiv:1805.11921, 2018.

[22] Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and Philip S Yu. A
survey on knowledge graphs: Representation, acquisition and applications. arXiv
preprint arXiv:2002.00388, 2020.

[23] Thomas N Kipf and Max Welling. Semi-supervised classification with graph con-
volutional networks. arXiv preprint arXiv:1609.02907, 2016.

[24] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recog-
nition with gradient-based learning. In Shape, contour and grouping in computer
vision, pages 319–345. Springer, 1999.

[25] Johann A. Makowsky. Why horn formulas matter in computer science: Initial
structures and generic examples. Journal of Computer and System Sciences, 34(2-
3):266–292, 1987.

[26] Christian Meilicke, Melisachew Wudage Chekol, Manuel Fink, and Heiner Stuck-
enschmidt. Reinforced anytime bottom up rule learning for knowledge graph com-
pletion. arXiv preprint arXiv:2004.04412, 2020.

[27] Christian Meilicke, Melisachew Wudage Chekol, Daniel Ruffinelli, and Heiner
Stuckenschmidt. Anytime Bottom-Up Rule Learning for Knowledge Graph Com-
pletion, 2019.

[28] Christian Meilicke, Melisachew Wudage Chekol, Daniel Ruffinelli, and Heiner
Stuckenschmidt. Anytime bottom-up rule learning for knowledge graph comple-
tion. In IJCAI, pages 3137–3143, 2019.

[29] Christian Meilicke, Manuel Fink, Yanjie Wang, Daniel Ruffinelli, Rainer Gemulla,
and Heiner Stuckenschmidt. Fine-grained evaluation of rule-and embedding-based
systems for knowledge graph completion. In International Semantic Web Confer-
ence, pages 3–20. Springer, 2018.

[30] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

56
[31] Seyedeh Fatemeh Mousavi, Mehran Safayani, Abdolreza Mirzaei, and Hoda Ba-
honar. Hierarchical graph embedding in vector space by graph pyramid. Pattern
Recognition, 61:245–254, 2017.

[32] Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar Kaul. Learning
attention-based embeddings for relation prediction in knowledge graphs. arXiv
preprint arXiv:1906.01195, 2019.

[33] Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Phung. A novel
embedding model for knowledge base completion based on convolutional neural
network. arXiv preprint arXiv:1712.02121, 2017.

[34] Ministry of Health of Vietnam. Memory and Attention, 2020. (Accessed on August
26, 2020).

[35] Stefano Ortona, Venkata Vamsikrishna Meduri, and Paolo Papotti. Robust dis-
covery of positive and negative rules in knowledge bases. In 2018 IEEE 34th
International Conference on Data Engineering (ICDE), pages 1168–1179. IEEE,
2018.

[36] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of
social representations. In Proceedings of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 701–710, 2014.

[37] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Lev-
skaya, and Jonathon Shlens. Stand-alone self-attention in vision models. arXiv
preprint arXiv:1906.05909, 2019.

[38] Andrea Rossi, Donatella Firmani, Antonio Matinata, Paolo Merialdo, and Denil-
son Barbosa. Knowledge graph embedding for link prediction: A comparative
analysis. arXiv preprint arXiv:2002.00819, 2020.

[39] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between
capsules. In Advances in neural information processing systems, pages 3856–3866,
2017.

[40] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared
Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter

57
language models using gpu model parallelism. arXiv preprint arXiv:1909.08053,
2019.

[41] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.
Line: Large-scale information network embedding. In Proceedings of the 24th
international conference on world wide web, pages 1067–1077, 2015.

[42] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchical multi-scale atten-
tion for semantic segmentation. arXiv preprint arXiv:2005.10821, 2020.

[43] Kristina Toutanova and Danqi Chen. Observed versus latent features for knowl-
edge base and text inference. In Proceedings of the 3rd Workshop on Continuous
Vector Space Models and their Compositionality, pages 57–66, 2015.

[44] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume
Bouchard. Complex embeddings for simple link prediction. In Proceedings of the
33rd International Conference on Machine Learning (ICML), 2016.

[45] Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. The
anatomy of the facebook social graph. arXiv preprint arXiv:1111.4503, 2011.

[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
In Advances in neural information processing systems, pages 5998–6008, 2017.

[47] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero,


Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint
arXiv:1710.10903, 2017.

[48] Thanh Vu, Tu Dinh Nguyen, Dat Quoc Nguyen, Dinh Phung, et al. A capsule
network-based embedding model for knowledge graph completion and search per-
sonalization. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), pages 2180–2189, 2019.

[49] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding.
In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 1225–1234, 2016.

58
[50] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov,
and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language under-
standing. In Advances in neural information processing systems, pages 5753–5763,
2019.

59
Appendix A. OPTIMAL HYPERPARAMETERS

In this section, we present the set of optimal hyperparameters used for both models:
the attention-based model and the ConvKB model. The hyperparameter optimization
process was conducted using grid search based on the Hits@10 evaluation metric. For
the attention-based model, we trained on the entire dataset without any data splitting.
In contrast, for the ConvKB model, we applied the same hyperparameter configuration
across all datasets. The detailed hyperparameters are shown in the following table:

µ Weight decay Epochs negative ratio Dropouts αLeakyRELU Nhead Dfinal γ


FB15k 1e-3 1e-5 3000 2 0.3 0.2 2 200 1
FB15k-237 1e-3 1e-5 3000 2 0.3 0.2 2 200 1
WN18 1e-3 5e-6 3600 2 0.3 0.2 2 200 5
WN18RR 1e-3 5e-6 3600 2 0.3 0.2 2 200 5

60

You might also like