Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
284 views65 pages

A Comprehensive Overview of Knowledge Graph Completion

1) The document provides a comprehensive overview of knowledge graph completion (KGC) methods. 2) KGC methods are divided into two main categories: those relying on structural information in the KG and those using additional information. 3) Within each category, KGC methods are further subdivided and compared in detail. The overview analyzes and discusses each category of KGC methods.

Uploaded by

phanpeter_492
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
284 views65 pages

A Comprehensive Overview of Knowledge Graph Completion

1) The document provides a comprehensive overview of knowledge graph completion (KGC) methods. 2) KGC methods are divided into two main categories: those relying on structural information in the KG and those using additional information. 3) Within each category, KGC methods are further subdivided and compared in detail. The overview analyzes and discusses each category of KGC methods.

Uploaded by

phanpeter_492
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Knowledge-Based Systems 255 (2022) 109597

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

A comprehensive overview of knowledge graph completion



Tong Shen, Fu Zhang , Jingwei Cheng
School of Computer Science and Engineering, Northeastern University, Shenyang, 110169, China

article info a b s t r a c t

Article history: Knowledge Graph (KG) provides high-quality structured knowledge for various downstream
Received 24 March 2022 knowledge-aware tasks (such as recommendation and intelligent question-answering) with its unique
Received in revised form 24 July 2022 advantages of representing and managing massive knowledge. The quality and completeness of
Accepted 3 August 2022
KGs largely determine the effectiveness of the downstream tasks. But in view of the incomplete
Available online 12 August 2022
characteristics of KGs, there is still a large amount of valuable knowledge is missing from the KGs.
Keywords: Therefore, it is necessary to improve the existing KGs to supplement the missed knowledge. Knowledge
Knowledge Graph Completion (KGC) Graph Completion (KGC ) is one of the popular technologies for knowledge supplement. Accordingly,
Classification there has a growing concern over the KGC technologies. Recently, there have been lots of studies
Comparisons and analyses focusing on the KGC field. To investigate and serve as a helpful resource for researchers to grasp the
Performance evaluation main ideas and results of KGC studies, and further highlight ongoing research in KGC, in this paper,
Overview we provide a all-round up-to-date overview of the current state-of-the-art in KGC.
According to the information sources used in KGC methods, we divide the existing KGC methods
into two main categories: the KGC methods relying on structural information and the KGC methods
using other additional information. Further, each category is subdivided into different granularity for
summarizing and comparing them. Besides, the other KGC methods for KGs of special fields (including
temporal KGC, commonsense KGC, and hyper-relational KGC) are also introduced. In particular, we
discuss comparisons and analyses for each category in our overview. Finally, some discussions and
directions for future research are provided.
© 2022 Elsevier B.V. All rights reserved.

1. Introduction other less universal relations are even more lacking. Knowl-
edge Graph Completion (KGC) aims to predict and replenish
Knowledge Graphs (KGs) describe the concepts, entities, and the missing parts of triples. As one of a popular KGC research
their relations in a structured triple form, providing a better abil- direction, Knowledge Graph Embedding (KGE) (or Knowledge Graph
ity to organize, manage and understand the mass of information Representation Learning) has been proposed and quickly gained
on the world [1]. In recent years, KG plays an increasingly impor- massive attention. KGE embeds KG components (e.g. entities and
tant role in lots of knowledge-aware tasks, and especially brings relations) into continuous vector spaces to simplify the manipula-
vitality to intelligent question answering, information extraction, tion and preserve the inherent structure of the KG simultaneously
and other artificial intelligence tasks [1–3]. There are a number [10–15]. Recently, there have been lots of studies focusing on the
of large-scale KGs such as DBPedia [4], Freebase [5], WordNet [6], KGC field. To facilitate the research of the KGC task and follow
and YAGO [7] (as shown in Table 1), which have been widely ex- the development in the KGC field, more and more review articles
ploited in many knowledge-aware applications. Facts in these KGs to sort out and summarize the recent KGC technologies.
are generally represented in a form of triple: (subject, predicate, Accordingly, several previous overviews on the KGC tech-
object), which be regarded as the fundamental data structure of niques are provided:
KGs and preserves the essential semantic information of KGs [8]. • Wang et al. [16] make the most relevant review with re-
Although KGs are of great value in applications, they are spect to KGC studies from 2012 to 2016. They first coarsely
still characterized by incompleteness because a large amount of group KGE models according to their input data (the in-
valuable knowledge exists implicitly or misses in the KGs [1]. put data including facts only or incorporating additional
Some data indicate that the deficiency rate of some common basic information. The additional information in [16] involves en-
relations in the current large KGs was more than 70% [9], while tity types, relation paths, textual descriptions, logical rules,
and a slight mention of several other information, such as
∗ Corresponding author. entity attributes and temporal information). Then they fur-
E-mail address: [email protected] (F. Zhang). ther make finer-grained categorizations based on the above

https://doi.org/10.1016/j.knosys.2022.109597
0950-7051/© 2022 Elsevier B.V. All rights reserved.
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 1 main categories depending on whether rely on the addi-


Several famous KGs. tional information of KGs: KGC merely with the structural
KG Fact Entity Relation Relying resource information of KGs and KGC with the additional information.
DBpedia 538M 4.8M 2813 Wikipedia, Expertise For the former category, KGC methods are reviewed under
YAGO 447M 9.8M 114 WordNet, Wikipedia three categories: Tensor/matrix factorization models, Trans-
Freebase 2400M 50M 37 781 Wikipedia, Expertise,
Swarm-intelligence
lation models, and Neural Network models. For the latter
NELL 0.4M 2M 425 Human-supplied category, we further divide it into two sub-categories: KGC
Wikidata 65M 15M 1673 Freebase, Swarm-intelligence methods based on the internal information inside KGs and
CN-DBPedia 222M 16M – Wikipedia, Expertise KGC methods relying on the extra information outside KGs.
Google KG 18 000M 570M 35 000 Freebase
When we introduce the internal information-based KGC
methods, we take account of five categories of information,
including node literals, entity-related information, relation-
grouping (e.g., the methods that only consider facts involve related information, neighborhood information, and rela-
two categories, distance-based KGC methods and semantic tional path information. Moreover, extra information-based
matching-based KGC). However, the work [16] is not a spe- KGC includes two families: rule-based KGC and KGC based
cific overview for the KGC task, this overview just takes KGC as on third-party data sources.
one of the downstream applications that KGE technologies can (3) From the perspective of comparison and analysis, for each
support. KGC category, we carry on the detailed comparison of di-
• Gesese et al. [17] make a brief summary of KGC tasks that verse granularity in both theory and experiment of intro-
only adds less than ten more recent articles compared with duced KGC methods. We also make thorough analysis and
[16]. Moreover, the work [17] mainly focuses on the KGE summary on it. On this basis, we give a global discussion
technology related to literal information. The literal informa- and prospect for the future research directions of KGC.
tion in [17] indicates the text descriptions, numerical values, The remainder of the paper is structured as follows: we first
images, or their combinations. give an overview of KG notations, definitions, technological pro-
• Rossie et al. [18] summarize 16 recent Link Prediction (LP) cess, datasets, evaluation criteria, as well as our categorization
models based on KG embeddings. However, the work [18] criteria in Section 2; then we discuss the two categories of KGC
does not refer to other KGC tasks, such as Triple Classi- methods relying on the structural information of KG and using
fication (TC) and Relation Prediction (RP) (we will give a the additional information in Section 3 and Section 4; next, our
specifical introduction to these KGC tasks in Section 2.1). review goes to three special technologies of KGC in Section 5. In
• Also, the other two overviews [19,20] briefly list and state Section 6, we make a discussion on outlook research directions.
several KGC-related studies. They neither make a thorough Finally, we make a conclusion in Section 7.
and careful introduction to specific KGC technical details nor
cover major KGC approaches. In addition, several surveys 2. Notations of knowledge graph completion and our catego-
[21–23] focus on the KG field but do not discuss specific rization criterion
works on KGC.
We first give some notations of KGC in Section 2.1. Then
With the development of technologies such as Transformer we further introduce a general process of KGC (see Section 2.2),
and pre-trained language models (e.g., BERT) in the past few where several key steps of KGC are provided. Further, we summa-
years, a large number of novel KGC techniques have appeared, rize the main KGC datasets and evaluation criteria for KGC in Sec-
which are either not covered or summarized in detail in the tion 2.3. We also briefly introduce the knowledge graph refine-
existing surveys. Besides, except for the information mentioned ment (KGR) technique, which is related to KGC (see Section 2.4).
in [16,17], more kinds of additional information such as entity Final, we give our categorization criterion (see Section 2.5).
neighbors, multi-hop relation paths, and third-party data sources
are used in the KGC field. Intuitively, the KGC methods based on 2.1. Notations of KGC
the additional information should be divided into much wider
scopes with more details. To conveniently introduce various KGC models, this paper
Compared with the overviews above, in this paper we pro- gives some notations of KGC as follows: we define a knowledge
pose a more comprehensive and fine-grained division overview graph (KG) as G = (E , R, C ), where E = (e1 , e2 , . . . , e|E | ) is the set
on Knowledge Graph Completion (KGC). Our paper covers al- of all entities contained in the KG. The total number of entities is
most all of the mainstream KGC techniques up to now. Our |E |. R = (r1 , r2 , . . . , r|R| ) represents the set of all relations in KG
overview provides more careful classification for the different with counts of |R|. T ⊆ E × R × E represents the whole triple
level of KGC categories. In detail, we make the following main set in the KG. Each triple is represented as (h, r , t), h and t mean
contributions: the head entity and the tail entity, and r is the relation between
the head entity and the tail entity. During KGE, the entities h,
(1) From the perspective of comprehensiveness, we provide a t and relations r in the KG are mapped to the constantly low
more comprehensive and systematic survey about the KGC dimensional vectors: vh , vr and vt . We define the scoring function
field. We pay particular attention to the literature from of KGC models as s(h, r , t) to estimate the plausibility of any fact
2017 to now, which is either not summarized in [16] or (h, r , t). In training phase, we formally define their loss objective
not detailedly introduced in the other previous overviews as L.
[17,19,20], and [18]. Also, we consider some special KGC KGC can be divided into three subtasks: triple classification,
techniques, including Temporal Knowledge Graph Comple- link prediction and relation prediction. Triple classification is
tion (TKGC), CommonSense Knowledge Graph Completion an important task in KGC which determines whether to add a
(CSKGC), and Hyper-relational Knowledge Graph Comple- triple to KGs by estimating whether this triple is true or not. Link
tion (HKGC). prediction task refers to the process of finding the missing entity
(2) From the perspective of detailed classification and summa- when the head entity or tail entity in the triple is missing. Rela-
rization, we summarize the recent KGC researches into two tion prediction judges the probability of establishing the specific
2
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

[25] step to preemptively remove those unlikely candidates and


simultaneously keep as many promising candidates as possible.
Usually, the filtering work is accomplished by generating several
filtering rules (also known as ‘‘pruning strategies’’) and applying
these rules to the candidate set to produce the condensed set of
most promising candidates [26].
• Facts identification. Finally, the learned KGC model in model
learning is applied to the above set of promising candidates gen-
erated by candidate processing, resulting in the set of missing
triples that are considered correct and are likely to be added into
the KG [26].

2.2.2. Two training techniques: Negative sampling and ranking set-


ting
(1) Negative sampling
Fig. 1. The general KGC process.
Basic idea of negative sampling. The existing triples in a given
KG are all correct triples, i.e., (h, r , t) ∈ T , where T means
a positive triple set. Since a KGC model needs to be trained
relations between two entities. The three subtasks of KGC can be
and verified with the help of negative triples, it is necessary
formulated as follows:
to perform negative sampling, i.e., to construct negative triples
Triple classification (TC): Given a triple (ei , rk , ej ), the goal is to and build a negative triple set T ′ . In general, the negative sam-
determine whether the current triple is true. pling is to replace one entity or relation (there are two options
in practice: replace only entity elements or replace both enti-
Link prediction (LP): Given a triple (?, rk , ej ) or (ei , rk , ?), the goal
ties and relations in a triple) randomly from a correct triple to
is to predict the missing head entity or tail entity ‘‘?’’.
make it become an incorrect triple. For example, for the case
Relation prediction (RP): Given the partial triple (ei , ?, ej ), the of (Bill Gates, gender , male), when we replace ‘‘Bill Gates’’ with
goal is to predict missing relations between ei and ej . other random entities in the KG, such as ‘‘Italy’’, a negative triple
(Italy, gender , male) is formed, and it is a negative triple (whose
2.2. An overview of KGC process label is ‘‘false’’). However, sometimes the triples formed after
random replacement are still true. For example, in the above
In this part, we give a general introduction to the whole example, if the head entity is randomly replaced with another
technological process of KGC in Section 2.2.1. In addition, we entity ‘‘Steve Jobs’’ in the KG, we find that the triple becomes
describe two training techniques: negative sampling and ranking (Stev e Jobs, gender , male) and it is still valid. Under this situation,
setting in Section 2.2.2. Fig. 1 illustrates the typical workflow of a we normally consider filtering out this kind of ‘‘negative triple’’
KGC process. from T ′ .
Sampling strategy. We introduce three kinds of common sam-
2.2.1. General KGC procedure
pling strategies: uniform sampling (‘‘unif’’), Bernoulli negative sam-
As it can be seen in Fig. 1, generally, a KGC process involves
pling method (‘‘bern’’) [15], and generative adversarial network
three parts: model learning, candidate processing and facts
(GAN)-based negative sampling [27].
identification.
• Uniform sampling (‘‘unif’’) is a comparatively simple sampling
• Model learning. First, before building a KGC model, there is strategy, which aims to sample negative triples according to the
usually a pre-processing step in charge of data preparation, which uniform distribution of sampling. In this way, all entities (or
includes negative sampling (sometimes it is not a necessary step, relations) are sampled in the same probability.
it also can be done online during model training) and datasets
splitting. The negative sampling aims to add a variable amount of • Bernoulli negative sampling method (‘‘bern’’) [15]: due to the
negative examples into the original KG to respond to the problem unbalanced distribution of the number of head entities and tail
that KGs only contain positive examples [24]. The datasets split- entities corresponding to a certain relation, i.e., the existence of
ting is responsible for splitting the pending KG data into a training multiple types of relations including ‘‘one to many’’, ‘‘many to
set, a validation set, and a testing set. The split datasets will next one‘‘, and ’’many to many’’ relations, it is not reasonable to re-
be used to train and evaluate the KGC model. Then, the KGC model place the head entities or the tail entities with a uniform manner.
usually is a classification model or a ranking model, whose target Therefore, the ‘‘bern’’ strategy [15] replaces the head entity or
is predicting whether a candidate triple is correct or not for a KG. the tail entity of a triple under different probabilities. Formally,
Generally, the learned KGC model tends to undergo an evaluation for a certain relation r, ‘‘bern’’ counts the average number of
process to be assessed through a variety of evaluation metrics. A head entities corresponding to per tail entity (denoted as hpt)
satisfied assessed result usually means a good KGC model. and the average number of tail entities corresponding to per head
entity (denoted as tph) in all triples with the relation r, and then
• Candidate processing. The candidate processing aims to obtain tph
it samples each head entity with probability tph+hpt , similarly,
verifiable triples. Those triples will be checked by the learned hpt
it samples each tail entity with probability tph+hpt . The ‘‘bern’’
KGC model in model learning. The candidate processing starts
sampling technique performs well in many tasks, it can reduce
with candidate set generation, which generates a candidate set
false-negative tags than ‘‘unif’’.
(the set of candidates are the triples that possibly be correct
but are not present in the KG) relying on algorithms or manual • GAN-based negative sampling [27]: Inspired by the wide appli-
works. Since the initial generated candidate set tends to be very cation of generative adversarial network (GAN) [27], Cai et al.
large regardless of whether the candidates are promising or not, [28] change the way of generating negative samples to GAN-
it has to further subsequently go through a candidate filtering based sampling in a reinforcement learning way, in which GAN
3
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 2
Common KGC benchmarks and their attributes.
Benchmark Entity Relation #Training #Validation #Test
WN11 38 696 11 112 581 2609 10 544
WN18RR 40 493 11 86 835 3034 3134
FB13 75 043 13 316 232 5908 23 733
FB15k 14 951 1345 48 3142 50 000 59 071
FB15k-237 14 541 237 272 115 17 535 20 466

Fig. 2. The data example in Freebase. on. We give a demonstration to illustrate the data in the Freebase
as Fig. 2. Topic Miyazaki Hayao is a cartoonist in the field of cartoon
domain, but a director in mov ie domain. It can be seen that
generator is responsible for generating negative samples, while Freebase is a database consists of multiple domains expanded
the discriminator can use translation models to obtain the vector by topics, the graph structure of every topic is controlled by its
representation of entities and relations, then it scores the gen- type and type properties. Typically, the subset FB15k and FB13 of
erated negative triples and feeds related information back to the Freebase, as well as the improved FB15k-237 based on FB15k, are
generator to provide experience for its negative samples genera- generally used as experimental benchmarks for method detection
tion. Recently, there have appeared a series of negative sampling in KGC:
techniques based on GAN (e.g. [29–31]), relevant experiments (1) FB15k: FB15K is created by selecting the subset of entities
have shown that this kind of methods can obtain high-quality that are also involved in the Wikilinks database and that also
negative samples, which are conducive to classify triples correctly possess at least 100 mentions in Freebase [11]. In addition, FB15K
in the training process of knowledge representation model. removes reversed relations (where reversed relations like ‘!/peo-
(2) Ranking setting ple/person/nationality’ just reverses the head and tail compared
In link prediction (LP) task, the evaluation is carried out by to the relation ‘/people/person/nationality’). FB15k describes the
performing head prediction or tail prediction on all test triples, ternary relationship between synonymous sets, and the synonym
and computing for each prediction how the target entity ranks sets that appear in the verification set and testing set also ap-
against all the other ones. Generally, the model expects the tar- pear in the training set. Also, FB15k converts n-ary relations
get entity to yield the highest plausibility. When computing the represented with reification into cliques of binary edges, which
predicted ranks, two different settings, raw and filtered sce- greatly affected the graph structure and semantics [18]. FB15K
narios, are applied. Actually a prediction may have more than has 592,213 triples with 14,951 entities and 1345 relations which
one valid answer: taking an example with the tail predicting were randomly split as shown in Table 2.
for (Barack Obama, parent , Natasha Obama), a KGC model may (2) FB15k-237 is a subset of FB15k built by Toutanova and Chen
associate a higher score to Malia Obama than to Natasha Obama, [32], which is aroused to respond to the test leakage problem due
i.e., there may exist other predicted fact that has been contained to the presence of near-identical relations or reversed relations
in the KG, e.g. (Barack Obama, parent , Malia Obama). Depending FB15k suffering from. Under this background, FB15k-237 was
on whether valid answers should be considered acceptable or not, built to be a more challenging dataset by first selecting facts
two separate settings have been devised [18]: from FB15k involving the 401 largest relations and removing all
• Raw setting: in this case, valid entities outscoring the target equivalent or reverse relations. Then they ensured that none of
one are considered as mistakes [18]. Thus for a test fact (h, r , t) the entities connected in the training set are also directly linked
in a testing set, the raw rank rankh of the target head entity h is in the validation and testing sets for filtering away all trivial
computed as follows (analogous for the tail entity): triples [18].
rankh = |e ∈ E \{h} : s(e, r , t) > s(h, r , t)| + 1 • WordNet [6]: WordNet is a large cognitive linguistics based KG
ontology, also can be regarded as an English Dictionary knowl-
• Filtered setting: in this case, valid entities outscoring the target edge base, whose construction process considers the alphabetic
one are not considered as mistakes [18], they are filtered out order of words and further form semantic web of English words.
when computing the rank: for the test fact (h, r , t), the filtered In WordNet, entities (called synsets) correspond to semantics,
rank rankh of the target head entity h is computed as (analogous and relational types define the lexical relations between these
for the tail entity): semantics. Besides, WordNet not only contains multiple types of
words such as polysemy, categories classification, synonymy and
rankh = |e ∈ E \{h} : s(e, r , t) > s(h, r , t) ∧ (e, r , t) ∈
/ T|+1 antonymy, but also includes the entity descriptions. Furthermore,
there are various post-produced subset datasets extracted from
2.3. Datasets and evaluation metrics WordNet, such as WN11, WN18, and WN18RR:
(1) WN11: it includes 11 relations and 38 696 entities. What is
Here we introduce some most frequently used datasets for more, the train set, the validation set, and the test set of WN11
KGC (see Section 2.3.1) and several evaluation metrics for KGC contain 112 581, 2609, and 10 544 triples, respectively [11].
(see Section 2.3.2). (2) WN18: it uses WordNet as a starting point and then iteratively
filters out entities and relationships with too few mentions [11,
2.3.1. Datasets 18]. Note that WN18 involves reversible relations.
We describe the datasets mainly developed on two KGs: Free- (3) WN18RR: WN18RR is built by Dettmers et al. [33] for re-
base and WordNet, and report some of their important attributes
lieving test leakage issue in WN18 that test data being seen by
in Table 2.
models at training time. It is constructed by applying a pipeline
• Freebase: Freebase is a public KG, whose content is added similar to the one employed for FB15k-237 [32]. Recently, they
all by users. Moreover, Freebase also extracts knowledge from acknowledge that 212 entities in the testing set do not appear in
opening KGs as a supplement [26]. The fundamental data items in the training set, making it impossible to reasonably predict about
Freebase including ‘‘Topic’’, ‘‘Type’’, ‘‘Domain’’, ‘‘Property’’ and so 6.7% test facts.
4
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 3
Detailed computing formulas of evaluation metrics for KGC.
Metrics Computing formula Notation definition Task
1
∑i=1 1
MRR MRR = |Q | |Q | ranki Q : query sets; |Q |: queries numbers; LP, RP
ranki : the rank of the first correct answer for the ith query
1
∑i=1
MR MRR = |Q | |Q | ranki Q : query sets; |Q |: queries numbers; LP, RP
ranki : the rank of the first correct answer for the ith query
Hits@n Hits@n = |Q1 | Count(ranki ≤ n), Count(): the hit test number in the top n rankings among test LP, RP
0 < i ≤ |Q | examples;
Q : query sets; |Q |: queries numbers;
ranki : the rank of the first correct answer for the ith query
1

MAP MAP = |Q | q∈Q APq APq : average precision of the query q; LP, RP
Q : query sets; |Q |: queries numbers
TP +TN
Accuracy Accuracy = TP +TN +FP +FN
TP: true positive; FP: false positive; TC
FN: false negative; TN: true negative
TP
Precision Precision = TP +FP
TP: true positive; TC
FP: false positive
TP
Recall Recall = TP +FN
TP: true positive; TC
FN: false negative
2∗Recall∗Precision
F1 score F1 = Recall+Precision
— TC

2.3.2. Evaluation metrics a good trade-off between completeness and correctness. Knowl-
In this section, we recommend evaluation metrics generally edge Graph Refinement (KGR) is proposed to infer and add miss-
used in KGC. Table 3 shows detailed computing formulas of these ing knowledge to the graph (i.e., KGC), and identify erroneous
mentioned metrics. pieces of information (i.e., error detection) [24]. Recently, KGR
is incorporated into recommender systems [34]. Tu et al. [34]
Mean Reciprocal Rank (MRR): MRR is widely used in the ranking
exploit the KG to capture target-specific knowledge relationships
problem which tends to return multiple results, such as LP and RP
in recommender systems by distilling the KG to reserve the useful
task for KGC. When dealing with such problems, the evaluation
information and refining the knowledge to capture the users’
system will rank the results by their scores from high to low.
preferences.
MRR evaluates a ranking algorithm according to its ranking of the
Basically, KGC is one of the KGR subtasks to conduct inference
target answer. The higher the target answer ranks, the better the
and prediction of missing triples. Error detection (e.g., [35,36]) is
ranking algorithm. In a formulaic view, for a query, if the target
another KGR subtask for identifying errors in KGs. Jia et al. [36]
answer ranks nth, then the MRR score is calculated as 1n (if there
establish a knowledge graph triple trustworthiness measurement
is no target answer among returned results, the score is 0).
model that quantifies the semantic correctness of triples and
Mean-Rank (MR) and Hits@n: Similar to MRR and generally used the true degree of the triples expressed. But note that KGC is a
in the Top-K ranking problem, MR and Hits@n are common met- relatively independent task to increase the coverage of KGs for
rics in KGC evaluation, especial in LP and RP tasks. MR represents alleviating the incompleteness of KGs. In our current overview,
the average ranks of target entity (or relation) in the testing set; we focus on the KGC techniques, and the issues about KGR can
Hits@n (usually, n = 1, 3, 10) indicates the proportion in the refer to [24,34].
testing set that predicted target entities (or relations) ranks in
the top n. The ranks are computed according to each prediction’s 2.5. Our categorization principle
scoring.
Accuracy: Accuracy refers to the ratio of correctly predicted The main full-view categorization of our review on KGC stud-
triples to the total predicted triples, it usually is applied to eval- ies is shown in Fig. 3.
uate the quality of classification models in TC task for KGC, its To follow the experienced rapid development of KGC models,
calculation formula is demonstrated in Table 3. we provide wide coverage on emerging researches for advanced
KGC technologies. We include the main literature since the begin-
Other evaluation metrics: There are other evaluation metrics ning of KGC research as comprehensive as possible and take care
for KGC tasks, such as Mean Average Precision (MAP) pays of the far-reaching and remarkable approaches in detail. We di-
attention to the relevance of returned results in ranking problem. vide KGC methods into two main categories according to whether
Some metrics closely related to ‘‘accuracy’’ in measuring the using additional information: Structure (triple) information-
classification problems, like ‘‘recall’’, ‘‘precision’’ and ‘‘F1 score’’. based KGC methods and Additional information-based KGC
Compared with MR, MRR, Hits@n, and ‘‘accuracy’’, these metrics methods (the additional information typically refers to some
are not continually employed in the field of KGC. The detailed other information that included inside or outside of KGs except
computing formulas of these mentioned metrics can be found in for the structure information, such as text description, artificial
Table 3. rules). Moreover, we further consider the source of additional
information — depending on whether it comes from the inner
2.4. Knowledge Graph Refinement (KGR) vs. KGC KG, we classify the additional information into two finer subclasses:
internal side information inside KGs and external extra in-
The construction process of large-scale KGs results that the formation outside KGs. In addition, we introduce some KGC
formalized knowledge in KGs cannot reasonably reach both ‘‘full techniques targeting certain fields, like Temporal Knowledge
coverage’’ and ‘‘fully correct’’ simultaneously. KGs usually need Graph Completion (TKGC), CommonSense KGC (CSKGC) and
5
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Fig. 3. Our Categorization Principle.

Hyper-relational KGC (HKGC). We also make a detailed compar- (2) External extra information outside KGs outside KGs, mainly
ison and summary among the methods of each small category. including two aspects: rule-based KGC and third-party data
We give a global discussion and prospect for the future research sources-based KGC.
directions of KGC. Specifically, our categorization principle is as • Other KGC technologies: we take additional attention on some
follows: other KGC techniques, such as Temporal KGC, CommonSense
• Structure information-based KGC methods: which only use KGC and Hyper-relational KGC.
the structure information of internal facts in KGs. For this cat-
egory, KGC is reviewed under semantic matching models and 3. Structural information-based KGC technologies
translation models according to the nature of their scoring func-
tions. The semantic matching models generally use semantic In this section, we focus on KGC technologies relying on struc-
matching-based scoring functions and further consists of ten- ture information only, give an account of several categories of
sor/matrix factorization models and neural network models. The methods belonging to this kind of KGC technologies: Semantic
translation models apply distance-based scoring function; Matching models in Section 3.1 and Translation models in Sec-
• Additional information-based KGC methods: which cooperate tion 3.2.
with additional information (the inside or outside information of
KGs except for the structure information) to achieve KGC. For this 3.1. Semantic matching models
category, we further propose fine-grained taxonomies respective
into two views about the usage of inside information or outside Semantic Matching models is a kind of models which com-
information: pute semantic matching-based scoring functions by measuring the
(1) Internal side information inside KGs involved in KGs, includ- semantic similarities of entity or relation embeddings in latent
ing node attributes information, entity-related information, relation- embedding space. In this category, we introduce two subclasses:
related information, neighborhood information, relational path infor- Tensor/Matrix Factorization Models (see Section 3.1.1) and Neural
mation; Network Models (see Section 3.1.2).
6
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 4
Characteristics of Tensor Factorization (TF) KGC methods.
Model Highlight Score function Loss functiona Parameters & Constrains
Tucker-based TF methods
TuckER [37] Tucker decomposition, s(h, r , t) = W ×1 vh ×2 vr ×3 vt Bernoulli Llog W ∈ Rde ×dr ×de ,
multi-task learning vh , vt ∈ Rde , vr ∈ Rdr
DEDICOM-based TF methods
RESCAL [13] Three-way bilinear TF s(h, r , t) = vhT Mr vt L2 vh , vt ∈ Rd , Mr ∈ Rd×d
LFM [38] Bilinear TF, s(h, r , t) ≜ Llog Rj ∈ Rp×p ,
decomposing the yT Mr y′ + vt T Mr z + z ′T Mr vt + vt T Mr vt , y, y′ , z , z ′ ∈ Rp
j
ur , vr ∈ Rp , α j ∈ Rd
∑d
r =1 αr Θr = ur vr
relation matrix Rj , T
Rj =
decreasing parameters
of RESCAL
Tatec [39] 2-way and 3-way s(h, r , t) = s1 (h, r , t) + s2 (h, r , t) Lmarg vhi , vti ∈ Rdi , i = 1, 2
interactions models, s1 (h, r , t) = vrT1 vh1 + vrT2 vt1 + vhT Ddiag vt1 +∆soft /hard vr1 , vr2 ∈ Rd1 ,
1
hard regularization, s2 (h, r , t) = vhT M r vt2 M r ∈ R d2 × d2
2
soft regularization
ANALOGY [40] Bilinear TF, s(h, r , t) = vhT Mr vt Llogistic h, t ∈ Rd , Mr ∈ Rd×d
normality relation matrix Mr MrT = MrT Mr , ∀r ∈ R,
commutativity relation matrix Mr Mr ′ = Mr ′ Mr , ∀r ∈ R.
REST [41] Subgraph tensors building, for quary (h, r , ?) : ve = vhT Mr A L2 vh , vt ∈ Rd , Mr ∈ Rd×d
RW-based SGS, s(h, r , t) = vhT Mr vt A ∈ RNe ×d
predicate sparsification operator,
Focused Link Prediction (FLP)
CP-based TF methods
DistMult [42] RESACL + diagonal matrices s(h, r , t) = vhT Mr diag vt max Lmarg Mr diag = diag(r), r ∈ Rd
ComplEx [43] Complex values s(h, r , t) = Re(v T

h Mr diag t ) Lnll + ∆L2 vh , vt ∈ Cd ,
Mr diag = diag(vr ), vr ∈ Cd
CP-based TF model
∑d−1
= Re( i=0 [vr ]i · [vh ]i · [v¯t ]i )

SimplE [44] Bilinear TF model, s(h, r , t) =


1
(sCP (h, r , t) + sCP (h, r −1 , t)) Lnll + ∆L2 vh , vt ∈ Rd , vr ∈ Rd
utilizing inverse relations,
∑d−12
sCP = i=0 [vr ]i · [vh ]i · [vt ]i
fully expressive-evaluation metric
DrWT [45] Fine-grained types inference, s(E , F , G, H) = χ L2 chi ∈ RS ×O×P ×D ,
domain knowledge modeling, = Cdiag ×s E ×p F ×o G ×d H Cdiag ∈ Rd×d×d×d ,
leverages additional E ∈ RS ×d , F ∈ RO×d
data outside KG, G ∈ RP ×d , H ∈ RD×d
4th-order TF
∑d−1 1 1 3
TriVec [46] ComplEx with three s(h, r , t) = i=0 ([vh ]i [vr ]i [vt ]i Lls + ∆N3 vh , vt ∈ Cd , vr ∈ Cd
components score function, +[vh2 ]i [vr2 ]i [vt2 ]i + [vh3 ]i [vr3 ]i [vt1 ]i )
three parts entity/relation
-representation
Additional training technologies
Ensemble Reproduces DistMult, s(h, r , t) = vhT · Mr diag · vt max Lmarg Mr diag = diag(r), r ∈ Rd
exp(s(h,r ,t))
DistMult [47] parameter adjustment, s′ (h, r , t) = P(t |h, t) = ∑
t̄ ∈ϵh,t exp(s(h,r ,t))
fine tuning technology
Regularizer R1 multiplicative- s(h, r , t) = Re(v T

h Mr diag t ) Lnll Mr diag = diag(vr ), vr ∈ Cd ,
-Enhanced Model L1 regularizer d−1 +∆R1 mL1 R1 (Θ ) =
[vr ]i · [vh ]i · [v¯t ]i )

[48] = Re( ∑ d∑
−1
i=0 |Re([vr ]i ) · Im([vr ]i )|,
r ∈R i=0
R2 (Θ ) = ∥Θ ∥22
Constraints NNE constraints, s(h, r , t) = Re(vhT Mr diag v¯t ) Lnll + ∆L2 Mr diag = diag(r), r ∈ Cd ;
-enhanced Model AER constraintsb 0 ≤ Re(e), Im(e) ≤ 1;
∑d−1
= Re( i=0 [vr ]i · [vh ]i · [v¯t ]i )
[49] s(ei , rp , ej ) ≤ s(ei , rq , ej ),
∀ e, ei , ej ∈ E

(continued on next page)

3.1.1. Tensor/matrix factorization models KGC, the relational data can be represented as a {0, 1}-valued
Here we introduce a series of Tensor Factorization (TF) models third-order tensor Y ∈ {0, 1}|E |×|R|×|E | , if the relation (h, r , t)
in detail and make a summary table (Table 4) for conveniently is true there meets Yh,r ,t = 1, and the corresponding three
exhibiting the characteristics of these models. Recently, tensors modes properly stand for the subject mode, the predicate mode
and their decompositions are widely used in data mining and and the object mode respectively. TF algorithms aim to infer
machine learning problems [13]. In KG field, the large-scale ten- a predicted tensor X ∈ R|E |×|R|×|E | that approximates Y in a
sor factorization has been paid more and more attention for KGC sense. Validation/test queries (?, r , t) are generally answered by
tasks. ordering candidate entities h′ through decreasing values of Xh′ ,r ,t ,
Based on a fact that KG can be represented as tensors (shown yet queries (h, r , ?) are answered by ordering entities t ′ with de-
in Fig. 4), KGC can be framed as a 3rd-order binary tensor comple- creasing values of Xh,r ,t ′ . In that context, numerous literature have
tion problem, tensors can also be regarded as a general method considered link prediction as a low-rank tensor decomposition
to replace common methods, such as graphical models [50]. For problem.
7
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 4 (continued).
Model Highlight Score function Loss functiona Parameters & Constrains
∑d−1
N3 regularizer [50] CP + p-norms regularizer s(h, r , t) = sCP = i=0 [vr ]i · [vh ]i · [vt ]i Lnll + ∆N3 vh , vt ∈ Rd ,∑
vr ∈ R∑
d

Ωpα (v ) = 31 Rr=1 3d=1 ∥vr ∥αp


(d)

s(h, r , t) = χ = vhi(b) ⊗ vti(b) ⊗ vri(b) chi ∈ {0, 1}Ne ×Ne ×Nr



B-CP [51] CP + binary value parameters, i∈[d] LCE
Bitwise Operations v(b)
= Q△ (vhi ), v (b)
= Q△ (vti ), vhd , vtd ∈ {+△, −△}d
hi ti
(b) vrd ∈ {+△, −△}d
vri = Q△ (vri )
+△ if x ≥ 0,
{
Q△ (x) = △sign(x) =
−△ if x < 0
QuatE [52] ComplEx in s(h, r , t) = Qhrotation · Qt , Lnll + ∆L2 Q ∈ HNe ×d , W ∈ HNr ×d ;
hyper-complex space Qx = {ax + bx i + cx j + dx k}, x = h, t ;
Wr = {ar + br i + cr j + dr k}, ah , bh , ch , dh ∈ Rd ;
Qhrotation = Qh ⊗ Wr◁ , at , bt , ct , dt ∈ Rd ;
Wr◁ = Wr ar , br , cr , dr ∈ Rd
|Wr |

JoBi [53] Joint learning: JoBi ComplEx: Lnll vh , vt , vr ∈ Rd


bilinear TF model + sbi (h, r , t) = Re(vhT diag(rbi )v¯t ),
auxiliary model (using entity stri (h, r , t) = Re(vhT diag(rtri )v¯t )
-relation co-occurrence pairs)
Linear & Quadratic ‘Linear + Regularized’, s(h, r , t) = vhT Mr vt Lquad + C /Rc vh , vt ∈ Rd , Mr ∈ Rd×d
Model [54] ‘Quadratic + Regularized’,
‘Quadratic + Constraint’
‘Linear + Constraint’
a
Lll (Lnll ), Lls , L2 , Lquad , Lmarg and LCE are (negative) log likely-hood loss, log softmax loss, L2 loss, quadratic loss, margin-based ranking loss and cross entropy loss
respectively, and ∆ indicates the regularization terms in loss function.
b
‘NNE’ and ‘AER’ represents non-negativity constraints and approximate entailment constraints.
c
‘C/R’ means Constraints and Regularations in [54].

Fig. 4. Knowledge Graph as Tensors [41].

3.1.1.1. Tucker-based TF methods. The well-known TF approach


Tucker [55] decomposes the original tensor χ ∈ RN1 ×N2 ×N3 into
three matrices A ∈ RN1 ×M1 , B ∈ RN2 ×M2 , C ∈ RN3 ×M3 and a smaller
core tensor Z ∈ RM1 ×M2 ×M3 , specifically in the form of Fig. 5. Visualization of the TuckER architecture [37].

χ ≈ Z ×1 A ×2 B ×3 C ,
where ×n denotes the tensor product along the nth mode. Factor However, Kolda and Bader [56] indicated that Tucker de-
matrices A, B and C can be considered as the principal compo- composition is not unique because the core tenser Z can be
nents in each mode if they are orthogonal. Typically, since M1 , transformed without affecting the fit if we conduct the inverse
M2 , M3 are smaller than N1 , N2 , N3 respectively, thus Z can be transformation to A, B and C .
regarded as a compressed version of χ , whose elements express
the interaction level between various components. 3.1.1.2. Decomposition into directional components (DEDICOM)-
based TF methods. Contrary to Tucker decomposition, the rank-r
TuckER [37] based on Tucker decomposition to the binary tensor
DEDICOM decomposition [57] is capable of detecting correlations
representation, it is a powerful linear model with fewer parame-
ters but obtains consistent good results, this is because it enables between multiple interconnected nodes, which can be captured
multi-task learning across relations. By modeling the binary ten- through singly or synthetically considering the attributes, rela-
sor representation of a KG according to Tucker decomposition as tions, and classes of related entities during a learning process.
Fig. 5, TuckER defines the score function as: DEDICOM decomposes a three-way tensor χ as:

s(h, r , t) = W ×1 vh ×2 vr ×3 vt χk ≈ ADk RDk AT , for k = 1, . . . , m


where W ∈ Rde ×dr ×de indicates the core tensor, vh , vt ∈ Rde
where the matrix A ∈ Rn×r indicates the latent components,
and vr ∈ Rdr represent the head entity embedding, tail entity
embedding and relation embedding respectively. It is worth not- the asymmetric matrix R ∈ Rr ×r reflects the global interactions
ing that TuckER can derive sufficient bounds on its embedding between the latent components, whereas the diagonal matrix
dimensionality, and adequate evidence can prove that several Dk ∈ Rr ×r models the participation situation of the latent compo-
linear models (such as RESCAL [13] and DistMult [42] that will nents in the kth predicate. Under this circumstance, DEDICOM is
be mentioned later) can be viewed as special cases of TuckER. suitable for the case where there exists a global interaction model
8
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

the expressiveness of the 2-way interaction terms caused by joint


parameterization. The combinatorial score function of Tatec as:

s(h, r , t) = s1 (h, r , t) + s2 (h, r , t)

where s1 () and s2 () correspond to the 2-way and 3-way term as


the following forms:

s1 (h, r , t) = vrT1 vh1 + vrT2 vt1 + vhT1 Ddiag vt1

s2 (h, r , t) = vhT2 M r vt2

where vhi , vti are embeddings of head and tail entities in Rdi space
Fig. 6. The there-way tensor model for relation data [13].
(i = 1, 2), vr1 , vr2 are vectors in Rd1 , while M r ∈ Rd2 ×d2 is a
mapping matrix, and D is a diagonal matrix that is independent
of the input triple. Depending on whether jointly update (or
for the latent components and its variation in the third mode can fine-tune) the parameters of 2-way and 3-way score terms in a
be described by diagonal factors [13]. second phase, Tatec proposes two term combination strategies to
RESCAL [13] is an early three-way DEDICOM-based model for effectively combine the bigram and trigram scores, fine tuning
KGC, which interprets the inherent structure of dyadic relational (Tatec-ft) and linear combination (Tatec-lc), the former simply
data. By employing a three-way tensor χ (as shown in Fig. 6), adding s1 term and s2 term and fine-tuned overall parameters in
where two modes are identically formed by the concatenated en- s, while the latter combines twos in a linear way. Besides, Tatec
tity vectors of the domain and the third mode holds the relations attempts hard regularization or soft regularization for the Tatec-ft
matrix in the domain. The score of a fact (h, r , t) is defined by a optimization problem.
bilinear function:
ANALOGY [40] is an extended version of RESCAL, it is interested in
s(h, r , t) = vhT Mr vt , (1) explicitly modeling analogical properties of both entity and rela-
tion embeddings, applies a bilinear score function used in RESCAL
where vh , vt ∈ Rd are entity embeddings, and Mr ∈ Rd×d is an (shown in formula (1)) but further stipulates the relation mapping
asymmetric matrix associated with the relation that models the
matrices must to be normal as well as mutually commutative as:
interactions between latent factors.
normality : Mr MrT = MrT Mr , ∀r ∈ R
LFM [38] is a bilinear TF model extending RESCAL, to overcome
the relational data growing issue thus to model large multi-
relational datasets. Similar to RESCAL, LFM embeds entities in commutativ ity : Mr Mr ′ = Mr ′ Mr , ∀r ∈ R
d−dimension vectors, encodes each relation into a matrix Mr j as
The relation matrices can be simultaneously
a bilinear operators among the entities, where 1 ≤ j ≤ Nr , Mr ∈
Rd×d . For efficiently modeling large relational factor, (h, r , t), LFM block-diagonalized into a set of sparse almost-diagonal matrices,
first redefines the previous linear score items as the following each decomposed matrix equips O(d) free parameters. Besides,
form to take account of the different interaction order including ANALOGY carries out the training process by formulating a differ-
unigram, bigram, and trigram orders between h, t and r: entiable learning objective, thus allows it to exhibit a favorable
theoretical power and computational scalability. Relevant evi-
s(h, r , t) ≜ yT Mr y′ + vh T Mr z + z ′T Mr vt + vh T Mr vt (2) dence has shown that multiple TF methods, such as DistMult [42],
HolE [58], and ComplEx [43] that will be mentioned later can be
where the parameters y, y , z , z ∈ R , which participate in the
′ ′ d
regarded as special cases of ANALOGY in a principled manner.
calculation yMr y′ , vh T Mr z + z ′T Mr vt terms, together with vh T Mr vt ,
these three terms represents uni-, bi- and trigram orders of REST [41] has fast response speed and good adaptability to evolve
interactions between h, t and r. The another improvement on data and yet obtains comparable or better performance than
RESCAL is decomposing the relation matrix Mr over a set of p-rank other previous TF approaches. Based on the TF model, REST uses
matrices Θr (1 ≤ r ≤ p) with: Random Walk (RW)-based semantic graph sampling algorithm
p
∑ (SGS) and predicate sparsification operator to construct Ensem-
Mr = αrj Θr (3) ble Components, which samples a large KG tensor in its graph
r =1 representation to build diverse and smaller subgraph tensors (the
Ensemble Architecture as Fig. 7), then uses them in conjunction
where Θr = ur wrT for ur , wr ∈ Rd , α j ∈ Rp . The Θr constrained
for focused link prediction (FLP) task. Experimental results show
by the outer product operator efficiently decreases the number of
that FLP and SGS are helpful to reduce the search space and
the overall parameters compared with the general relation matrix
noise. In addition, the predicate sparsification can improve the
parameterization process in RESCAL, which greatly speeds up the
prediction accuracy. REST can deliver results on demand, which
computations relying on traditional linear algebra. LFM normal-
makes it more suitable for the dynamic and evolutionary KGC
izes the terms appearing in formulas (2) and (3) by minimizing
field.
the negative log-likelihood over a specific constraint set.
Tatec [39] cooperates with both 2-way and 3-way interactions 3.1.1.3. CANDECOM/PARAFAC (CP)-based TF methods. The most
models to capture different data patterns in respective embed- well known canonical tensor decomposition method relevant to
ding space, which obtains a better performance outstripping the KGC field might be the CANDECOM/PARAFAC (CP) [59], in which
best of either constituent. Different from the closest relative a tensor χ ∈ RN1 ×N2 ×N3 was represented as a sum of R rank one
(1) ⨂ (2) ⨂ (3)
model LFM, Tatec combines the 3-way model and constrained 2- tensors xr xr xr , thus:
way model but pre-trains them separately. Tatec learns distinct R

embeddings and relation parameters for the 2-way and the 3- χ= x(1) (2) (3)
r · xr · xr
way interaction terms so that it avoids the problem of reducing r =1

9
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

What is more, shown as formula (5), ComplEx’s bilinear energy


function consists of various interaction parts unlike DisMult with
only one bilinear product component.
d−1

s(h, r , t) = Re(vhT Mr diag v¯t ) = Re( [vr ]i · [vh ]i · [v¯t ]i ) (5)
i=0

where vh , vt , vr ∈ Cd , Re() indicates an operate to obtain the real


part of a complex value. [vx ]i represents the ith element of vx and
v¯t means the conjugate of vt . The spacial style of every compo-
nents with combination of real numbers and imaginary numbers
interprets the ability of ComplEx to model antisymmetry relations
in KGs. Noting that HolE is shown to be a special case for ComplEx
whose conjugate symmetry is imposed on embeddings.
SimplE [44]: Inspired by Canonical Polyadic (CP) [59], SimplE
improves it by utilizing the inverse of relations to handle the
poor performance problem in CP caused by the independence
Fig. 7. Ensemble Architecture in REST [41].
of the entity vectors. SimplE considers two vectors vr , vr−1 for
each relation r, the similarity score function of SimplE for a triple
(ei , r , ej ), ei , ej ∈ E is defined as the average of the CP scores for
(ei , r , ej ) and (ej , r −1 , ei ), this setup allows the two embedding of
each entity to be learned dependently and makes the SimplE be
considered as a bilinear model, scores each triplet as:
s(h, r , t) = 1/2(sCP (h, r , t) + sCP (h, r −1 , t))

d

sCP = [vh ]i · [vr ]i · [vh ]i
i=1

SimplE also use a log-likelihood loss to avoid over-fitting. Sim-


Fig. 8. The tensor operated via CP and the score of a triple (h, r , t) [50]. plE model is not only fully expressive, it performs very well
empirically despite (or maybe because of) its simplicity.

(i) DrWT [45] aims at fine-grained types inference in KGs, it ex-


where · denotes the tensor product, r ∈ {1, . . . , R}, and xr ∈ RNi . plicitly models domain knowledge and leverages additional data
Fig. 8 shows the representation of CP decomposition and the outside KG, the anchor linked Wikipedia page document of enti-
scoring of a given triplet. In particular, the smallest r contained ties, and the extra relations mapped from additional data sources.
in the decomposition of given χ is called the canonical rank of DrWT uses CP based 4th-order Tensor Factorization which fac-
χ . Although the current implementations of CP on standard KGC torizes each 4th-order domain-relevance weighted tensor χ ∈
benchmarks are known to perform poorly compared to more RS ×O×P ×D as:
specialized methods, CP owns a surprising expressive ability, thus
a series of works attempt to understand the limits of CP and s(E , F , G, H) = χ = Cdiag ×S E ×P F ×O G ×D H
further extend it for KGC. where the diagonal core tensor C ∈ Rd×d×d×d and the feature ma-
DistMult [42] replaces the dense matrix in RESACL [13] with trices E ∈ RS ×d , F ∈ RO×d , G ∈ RP ×d and H ∈ RD×d are the model
diagonal matrices to significantly reduce parameters of RESACL, parameters that have to be learned, and the scoring s(E , F , G, H)
the score function is defined as follows: is the tensor product to multiply a matrix on dimension x with
a tensor. DrWT is an attempt for explicitly leveraging domain
s(h, r , t) = vhT Mr diag vt (4) knowledge in KG, and for utilizing the additional large amount
of interactions among multiple entities and text descriptions. On
where Mr diag = diag(vr ). However, DistMult represents embed-
the other hand, it further discusses probabilistic inference based
ding vectors with real values may make it model symmetric
on collective multilevel type classification and latent similarity of
representation of relations only due to the symmetric nature of
typed entities.
the product operator on real numbers.
Moreover, Holographic Embeddings (HolE) [58] is not a CP- TriVec [46] is an efficient novel TF-based KG embedding model
based method, but it can model asymmetric relations as well as for stand benchmark datasets and/or more challenging datasets
RESCAL, and also achieves the same simplicity as DisMult by only for practical application scenarios. TriVec improves the Com-
perform an efficient circular correlation. With the commutative plEx by replacing the four-parts embedding score function of
circular correlation of tensors, HolE can generate compositional ComplEx with three components style and representing each
fixed-width representations, i.e., it allows Rd × Rd = Rd , which entity and relation utilizing three parts, which enables TriVec to
significantly reduces the parameter number but remains high deal with both symmetric and asymmetric relations. Moreover,
scalability, economical computing capability, and easy training. TriVec adapts a kind of combined form loss function for training,
where applies the traditional ranking loss with the squared er-
ComplEx [43] first uses complex values embeddings in a complex ror and the logistic loss, and the multi-class configuration with
space to handle both symmetric and antisymmetric binary rela- negative-log softmax loss simultaneously. TriVec also prepares a
tions, where each embedding is represented using two vectors new benchmark dataset, NELL239, and produces a real biological
(real and imaginary numbers). In addition to that, ComplEx also application dataset based on the Comparative Toxicogenomics
represents those tail entities as the complex conjugate of them, Database (CTD) database especially, which aims at assessing the
so that it can encode both symmetric and asymmetric relations. practical significance of TriVec.
10
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

3.1.1.4. Additional training technologies. The scalable and efficient


performance of these bilinear models have encouraged lots of
AER : s(ei , rp , ej ) ≤ s(ei , rq , ej ), ∀ei , ej ∈ E (7)
studies to investigate boosting the DistMult and the ComplEx
models by exploiting different training objectives and regulariza- In the formula (6), non-negativity constraints are imposed on
tion constraints [47,50]. both the real part and the imaginary part of the entity vector,
Ensemble DistMult [47] reproduces DistMult [42] through sim- which states that only positive properties will be stored in entity
ple parameter adjustment and fine-tuning technology, and gets representations. Note that 0 and 1 are all-zeros values and all-
better scores than most previous KGC methods. Ensemble Dist- ones values of d-dimensional vectors, and ≥, ≤, = denote the
Mult employs a softmax function normalizing, imposes it on the entry-wise comparisons. In the formula (7), it formally describes
original score function in DistMult, which turn the formula (4) that when there has a strict entailment rp → rq , then the triple
into: score must meet one request that if (ei , rp , ej ) is a true fact with
a high score s(ei , rp , ej ), then the triple (ei , rq , ej ) with an even
s(h, r , t) = vhT · Mr diag · vt higher score should also be predicted as a true fact.
As Lee and Seung [60] pointed out, non-negativity, in most
exp(s(h, r , t)) cases, will further induce sparsity and interpretability. Except for
s′ (h, r , t) = P(t |h, t) = ∑
t̄ ∈ϵh,t exp(s(h,r ,t)) improving the KG embedding, the proposed simple constraints
impose prior beliefs upon the embedding space structure but do
where ϵh,t is the candidate answer entities set for the (h, r , ?)
not significantly increase the space or time complexity.
query.
Ensemble DistMult concludes that increasing the number of Weighted Nuclear 3-Norm Regularizer Model (N3) [50] also
negative instances can have a positive impact on the results, improves basic CP model by testing a novel tensor nuclear p-
and the batch size also has an impact that a larger iteration norms based regularizer, it first indicated that the regularizer
batch size can promote the model effect. It is highlighted that based on the square Frobenius norms of the factors [42,43] mostly
the question in doubt that whether a model is achieved through used in the past is not a tensor norm since it is un-weighted. Then
better algorithms in theoretical or merely through more extensive this paper introduces a variational form of the nuclear 3 − norm
parametric search. By the way, since the filtered scenario assumes to replace the usual regularization at no additional computational
that there is only one correct answer among the candidates in cost with the form of:
the KG, which is unrealistic, Ensemble DistMult puts forward a R 3
proposal that it is necessary to pay more attention to the original
Ωpα (v ) = 1/3 ∥vr(d) ∥α
∑ ∑
raw scenario rather than the filtered setting, however, this requires
r =1 d=1
the use of other information retrieval metrics, such as Mean
(d)
Average Precision (MAP). where p = 3 when it is a nuclear 3-norm, and vr , d = 1, 2, 3
Regularizer-Enhanced Model [48] also aims to improve ComplEx means the tensor of subject mode, the predicate mode and the
by designing a novel L1 regularizer called R1 multiplicative L1 object mode respectively. Finally, Lacroix et al. [50] discuss a
regularizer, which can support modeling both symmetric and an- weighting scheme analogous to the weighted trace-norm pro-
tisymmetric relations. The regularizer R1 in a form of an L1-norm posed in Srebro and Salakhutdinov [61] as:
penalty to allow the sparsity of pairwise products. More specif- R 3
1 ∑∑ √
ically, this L1 penalty term is expected to help guide learning Weighted(Ωpα (v )) = ∥ p q(d) ⊙ vr(d) ∥α
a vector for relation r in accordance with whether r is sym- 3
r =1 d=1
metric, antisymmetric, or neither of them, as observed in the √
training data due to the real and imaginary parts of a relation vec- where q(d)
represents the weighting implied by this regulariza-
tor govern the symmetry/antisymmetry of the scoring function tion scheme. Surprisingly, under the using of the nuclear p-norms
for the relation. Since parameters are coupled componentwise, [62] and the Reciprocal setting, the tensor regularizer recreates a
the proposed model can also deal with non-symmetric, non- much successful result of CP decomposition (even better than ad-
antisymmetric relations which have varying degrees of symme- vanced ComplEx), and this reflects a phenomenon that although
try/antisymmetry. Setting the vector component items in vector the effect of optimization parameters is well known, neither the
Θ represents the overall parameters of the model, the regularizer effect of the formula nor the effect of regularization has been
terms as follows: properly evaluated or utilized. This work suggests one possibil-
d−1
∑∑ ity: when each model is evaluated under appropriate optimal
R1 (Θ ) = |Re([vr ]i ) · Im([vr ]i )|, vr ∈ Cd configuration, its performance may make great progress, this ob-
r ∈R i=0
servation is very important to assess and determine the direction
for further TF study for KGC.
R2 (Θ ) = ∥Θ ∥2
Binarized Canonical Polyadic Decomposition (B-CP) [51] ex-
Although the non-convex R1 term makes the optimization harder, tends the CP model by replacing the original real-valued param-
experiments reports that multiplicative L1 regularization not only eters with binary values. Only conducts the bitwise operation for
outperforms the previous standard one in KGC, but also is robust score computation, B-CP has been proven a successful technique
enough against random initialization. obtains more than one order of magnitude while maintaining
Constraints-enhanced Model [49] imposes simple constraints on the same task performance as the real-valued CP model. Specif-
KGC, introduces non-negativity constraints (NNE) on entity repre- ically, setting D is the number of rank-one tensors, and △ is a
sentations to form compact and interpretable representations for positive constant value, B-CP binarizes the original factor ma-
entities, and approximate entailment constraints (AER) on relation trices A, B ∈ RNe ×d and C ∈ RNr ×d in CP with: A(b) , B(b) ∈
representations for further encoding regularities of logical en- {+△, −△}Ne ×d , C ∈ {+△, −△}Nr ×d , thus the original boolean
tailment between relations into their distributed representations, tensor χhrt ∈ {0, 1}Ne ×Ne ×Nr is turned into:
these two constraints are: ∑
s(h, r , t) = χhrt = vhi(b) · vti(b) · vri(b)
NNE : 0 ≤ Re(ve ), Im(ve ) ≤ 1, ∀e ∈ E , ve ∈ C d
(6) i∈[d]

11
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

(b) (b) (b)


where vhi = Q△ (vhi ), vti = Q△ (vti ), vri = Q△ (vri ) are binarized
through: stri (h, r , t) = Re(vhT diag(vrtri )v¯t )
+△ if x ≥ 0,
{
Q△ (x) = △sign(x) = where v¯t denotes the complex conjugate of vt , and Re(x) de-
−△ if x < 0, notes the real part of the complex vector x. The two modules
Here the binarization function can be further extended to vectors: are jointly optimized during training, but during test time it
Q△ (x) means a vector with ith element is Q△ (xi ). merely uses stri models, that is the reason why the additional
auxiliary module does not affect the number of final parame-
By deriving a bound on the size of its embeddings, B-CP is
ters of the trained model. JoBi also utilizes entity-relation pair
proved to be fully expressive.
occurrences to improve the distribution of negative examples
QuatE [52] is an extension to ComplEx on hyper-complex space. for contrastive training, which allows the model to learn higher
QuatE creatively introduced hyper-complex representations to quality embeddings with much fewer negative samples. Finally, a
learn KG embeddings more expressively, it uses quaternion em- negative log-likelihood loss of softmax and a binary cross-entropy
beddings, hyper-complex-valued embeddings with three imagi- loss are used for stri and sbi respectively, further the two losses
nary components to model entities and considers the rotations in are combined via a simple weighted addition with a tunable
the quaternion space to represent relations. In QuatE, each entity hyper-parameter α :
embedding is represented by a quaternion matrix Q ∈ HNe ×k , and
L = Ltri + α Lbi
the relation embeddings are denoted by W ∈ HNr ×k , where k is
the dimension of the embedding. For a triplet (h, r , t), denotes
Linear & Quadratic Model [54] presents a group of novel meth-
Qh = {ah + bh i + ch j + dh k : ah , bh , ch , dh ∈ Rk } and Qt =
ods for embedding KGs into real-valued tensors, including four
{at + bt i + ct j + dt k : at , bt , ct , dt ∈ Rk } as the head entity h modules, ‘Linear + Regularized’, ‘Quadratic + Regularized’,
and tail entity t respectively, while the relation r is expressed in ‘Quadratic + Constraint’ and ‘Linear + Constraint’, where two of
Wr = {ar + br i + cr j + dr k : ar , br , cr , dr ∈ Rk } (in a quaternion the models optimize a linear factorization objective and two for
Q = a + bi + cj + dk, the a, b, c, d are real numbers and i, j, k a quadratic optimization. All in all, it reconstructs each of the k
are imaginary units and are square roots of –1). Then the scoring relation slices of the order-3 tensor χ as:
function with the use of quaternion inner product:
χk ≈ Aα Rk ATβ (8)
s(h, r , t) = Qh′ · Qt
where A is the collection of p-dimensional entity embeddings, R
The Qh′ means the head entity rotation conducted with the Hamil- is the collection of relation embeddings. The matrices Aα and Aβ
ton product: are elements contained in A which meet Aα , Aβ ∈ RNe ×d with d
is dimension of both entity and relation embeddings. The whole
Qh′ = Qh ⊗ Wr◁
augmented reconstruction minimized loss objection are formed
and Wr◁ = p + qi + uj + v k is a unit quaternion by normalizing as:
the relation quaternion Wr , which calculated by: L =minA,R f (A, R) + g(A, R) + fs (A, R, C )
(9)
Wr ar + br i + cr j + dr k + fρ (A, R, C ) + fLag (A, R, C )
Wr◁ (p, q, u, v ) = = √
|Wr | a2r + b2r + cr2 + d2r where f (A, R) means the reconstruction loss reflecting each of the
Compared to the complex Hermitian operator and the inner k relational criteria in the formula (8), the g(A, R) term repre-
product in Euclidean space, the Hamilton operator provides a sents the standard numerical regularization of the embeddings,
greater extent of expressiveness, it can aptly capture latent inter- fs (A, R, C ) using similarity matrix C proposed in this work to
conduct knowledge-directed enrichment with extra knowledge.
dependencies (between all components), and support a more
Additionally, the two terms fρ (A, R, C )+fLag (A, R, C ) in the formula
compact interaction between entities and relations. It is also
(9) respectively reflects the added knowledge-directed enrich-
worth mentioning that the rotation over four-dimensional space
ment items about new regulars and constraints. This work can
has more degree of freedom than complex plane rotation. Since
easily use prior background knowledge provided by users or
QuatE is a generalization of ComplEx on hyper-complex space
extracted automatically from existing KGs, providing more robust
but offers better geometrical interpretations, it also satisfies the
and provably convergent, linear TF methods for KG embedding.
key request of symmetry, anti-symmetry, and inversion relations
learning. Noting that when the coefficients of the imaginary units 3.1.1.5. Performance analysis about TF models. We integrate ex-
j and k are all set to zero, the obtained complex embeddings will perimental results on WN18 and FB15K datasets from most of the
be the same as in ComplEx yet the Hamilton product will also over-mentioned models (as shown in Table 5). Fig. 9 shows the
degrade to complex number multiplication, while even obtains performance of TF models on WN18RR and FB15K-237 datasets
the DistMult case when it further removes the normalization of for further analysis.
the relational quaternion. a. Preliminary Performance Analysis
JoBi [53] designs an auxiliary model using entity-relation co- From Table 5, we can see that the improved extension meth-
occurrence pairs for joint learning with the base model (can be ods based on the original linear tensor decomposition models
any bilinear KGE models). The occurrences of entity-relation pairs (such as Complex) have achieved high competitive MRR, Hits@10
would overcome data sparsity well, and also bias the model to and accuracy results, which can be summarized as follows:
score plausible triples higher. JoBi creatively contains two copies (1) Regularization analysis: In WN18, Manabe et al. [48] and
of a bilinear model, the base triple model is trained about the Lacroix et al. [50] try to use different regularization techniques
triple’s labels, while the pair model is trained on occurrences to improve the traditional TF method, they both obtain satisfying
of entity-relation pairs within the triples. For the triple (h, r , t), performance. In [48], the proposed multiplicative L1 regularizer
the scoring functions sbi and stri for the pair and triple models (‘ComplEx w/ m L1’) emerges powerful comparability and even
respectively are shown as: exceeds the previous baselines. The method in [50] performs well
because it applies the nuclear 3-norm regularizer (‘ComplEx-N3-
sbi (h, r , t) = Re(vhT diag(vrbi )v¯t ) R’). Additionally, the work [50] resets multi-class log-loss and
12
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 5
Statistic about experimental results of TF models on WN18 and FB15K. We use the bold and italic to mark the scores ranking first and second under the same
metrics respectively.
WN18 FB15K
MRR Hits@1 Hits@3 Hits@10 MR MRR Hits@1 Hits@3 Hits@10 MR
RESCAL [13] 0.89 0.842 0.904 0.928 – 0.354 0.235 0.409 0.587 –
DistMult [42] 0.83 – – 0.942 – 0.35 – – 0.577 –
Single DistMult [47] 0.797 – – 0.946 655 0.798 – – 0.893 42.2
Ensemble DistMult [47] 0.79 – – 0.95 457 0.837 – – 0.904 35.9
ComplEx [43] 0.941 0.936 0.945 0.947 – 0.692 0.599 0.759 0.84 –
ComplEx w/std L1 [48] 0.943 0.94 0.945 0.948 – 0.711 0.618 0.783 0.856 –
ComplEx w/mul L1 [48] 0.943 0.94 0.946 0.949 – 0.733 0.643 0.803 0.868 –
ComplEx-NNEc [49] 0.941 0.937 0.944 0.948 – 0.727 0.659 0.772 0.845 –
ComplEx-NNE+AERc [49] 0.943 0.94 0.945 0.948 – 0.803 0.761 0.831 0.874 –
ANALOGY [48] 0.942 0.939 0.944 0.947 – 0.725 0.646 0.785 0.854 –
RESCAL + TransE [31] 0.873 – – 0.948 510 0.511 – – 0.797 61
RESCAL + HolE [31] 0.94 – – 0.944 743 0.575 – – 0.791 165
HolE + TransE [31] 0.938 – – 0.949 507 0.61 – – 0.846 67
RESCAL + HolE + TransE [31] 0.94 – – 0.95 507 0.628 – – 0.851 52
SimplE [44] 0.942 0.939 0.944 0.947 – 0.727 0.66 0.773 0.838 –
ComplEx-N3-Sa [50] 0.95 – – 0.96 – 0.8 – – 0.89 –
CP [51] 0.942 0.939 0.945 0.947 – 0.72 0.659 0.768 0.829 –
CP-FRO-Rb [50] 0.95 – – 0.95 – 0.86 – – 0.91 –
CP-N3-Rb [50] 0.95 – – 0.96 – 0.86 – – 0.91 –
ComplEx-FRO-Rb [50] 0.95 – – 0.96 – 0.86 – – 0.91 –
ComplEx-N3-Rb [50] 0.95 – – 0.96 – 0.86 – – 0.91 –
B-DistMult [51] 0.841 0.761 0.915 0.944 – 0.672 0.558 0.76 0.854 –
B-CP [51] 0.945 0.941 0.948 0.956 – 0.733 0.66 0.793 0.87 –
QuatE [52] 0.949 0.941 0.954 0.96 388 0.77 0.7 0.821 0.878 41
QuatE-N3-R [52] 0.95 0.944 0.954 0.962 – 0.833 0.8 0.859 0.9 –
QuatE+TYPEc [52] 0.95 0.945 0.954 0.959 162 0.782 0.711 0.835 0.9 17
TuckER [37] 0.953 0.949 0.955 0.958 – 0.795 0.741 0.833 0.892 –
a
‘‘S’’ means the Standard learning.
b
‘‘R’’ denotes the Reciprocal learning.
c
‘‘NNE’’, ‘‘AER’’ and ‘‘TYPE’’ denote the non-negativity constraints, approximate entailment constraints [49] and the type constraints [52], respectively.

selects a larger rank scope for a more extensive search about op- lightweight models ComplEx and SimplE that are famous for
timization/regularization parameters, which are also the reasons simplicity and fewer parameters equipment, which is because the
for the good performance of it. Approaches that apply nuclear TuckER allows knowledge sharing between relations through the
3 − norm regularizer still show extraordinary talents in FB15K, but core tensor so that it supports multi-task learning. In comparison,
most of the improvements are statistically significant than those the same multi-task learning benefited ‘ComplEx-N3’ [50] forces
on WN18. parameter sharing between relations by ranking regularization
(2) Constraints on entities and relations: The results of of the embedding matrices to encourage a low-rank factoriza-
‘ComplEx-NNE+AER’ [49] demonstrate that imposing the non- tion, which uses the highly non-standard setting de = dr =
negativity and approximate entailment constraints respectively 2000 to generate a large number of parameters compared with
for entities and relations indeed improves KG embedding. In TuckER, resulting slightly lower grades than TuckER. Addition-
Table 5, ‘ComplEx-NNE’ and ‘ComplEx-NNE+AER’ perform better ally, both QuatE and TuckER also achieve remarkable results on
than (or as equally well as) ComplEx in WN18. We can find an FB15K, especially QuatE on Hist@1, outperforms state-of-the-art
interesting sight that by introducing these simple constraints, models while the second-best results scatter amongst TuckER
‘ComplEx-NNE+AER’ can beat strong baselines, including the best and ‘ComplEx-NNE+AER’. Unlike the constraints-used methods
performing basic models like ANALOGY and those previous exten- that target applying prior beliefs to shrink the solution space,
sions of ComplEx, but can be derived such axioms directly from QuatE achieves high grades relying on effectively capturing the
approximate entailments in [49]. Exerting proper constraints symmetry, antisymmetry, and inversion relation patterns, which
to the original linear TF models is also very helpful for KGC, take a large portion in both WN18 and FB15K. On FB15K, TuckER
just as in WN18, the constraints used ‘ComplEx-NNE+AER’ also obtains lackluster performance across MRR and Hits@10 metrics
out-performs ComplEx and other traditional TF models. but excesses on the toughest Hits@1 metric.
(3) Different dimension space modeling: In addition, the explo- (4) Considering hyper-parameters setting: It is notable that
rations of new tensor decomposition mode in different dimension on FB15K, the Ensemble DistMult also performs high results
space also achieve inspiring success. From Table 5 we can observe across both MRR and Hits@10, this is because it further improves
that on WN18, the quaternion-valued method QuatE performs DistMult only with proper hyper-parameters settings. This work
competitively compared to the existing state-of-the-art models helps us to solve the doubt: whether an algorithm was achieved
across all metrics and deservedly outperforms the representative due to a better model/algorithm or just by a more extensive
complex-valued basic model ComplEx, which is because that
hyper-parameter search. On the other hand, the good results of
quaternion rotation over the rotation in the complex plane has
DistMult reported in Ensemble DistMult also because of using a
advantages in modeling complex relations. Besides, the N3 regu-
large negative sampling size (i.e., 1000, 2000).
larization and reciprocal learning in QuatE or the type constraints
in QuatE also play an important role in QuatE’s success. Another b. Further Performance Verification:
eye-catching method TuckER takes account of the binary tensor We have analyzed the effects of many factors on performance,
representation of KGs, which outperforms almost all linear TF especially the effectiveness of constraints or regularization tech-
models along with their relevant extension versions on all metrics niques. To further evaluate the efficacy, we select the experimen-
in WN18. TuckER consistently obtains better results than those tal results evaluated on WN18RR and FB15K-237 for illustration.
13
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

We also conclude from TuckER that the simple linear models


have valuable expressive power and are still worth to be served
as a baseline before moving onto more elaborate models. Overall,
we can see that the linear TF methods still have potential to be
further improved by appropriate constraints, regularization, and
parameter settings.
4. Potential Threatens. However, when exploring new improved
methods, we should pay attention to the potential threatens. For
example, N3 normalization will require larger embedded dimen-
sions, and the number of Tucker parameters will increase linearly
with the number of entities or relations in KGs, so that the
scalability and economy of the algorithm should to be considered.

3.1.2. Neural network models


We will give a detailed introduction about Neural Network
models on KGC study. A summary table for exhibiting the general
features of introduced neural network KGC methods can be found
in Table 6.
In recent years, distributed representations that map discrete
language units into continuous vector space have gained signifi-
cant popularity along with the development of neural networks
[64,73–75]. However, human-like reasoning remains as an ex-
tremely challenging problem partially because it requires the
effective encoding of world knowledge using powerful models
[64]. At the same time, it has been found that neural networks
can intelligently capture the semantic features of entities and re-
lations and reasonably model the semantic relationships between
discrete entities, which can help learn more accurate embeddings
of KGs. Meanwhile, more and more complex and effective deep
neural network structures have been developed so far, leading to
a large amount of studies that apply these novel neural network
Fig. 9. MRR, Hits@10 of TF methods on WN18RR and FB15K-237. ‘‘*’’, ‘‘**’’, ‘‘#’’ frameworks to KGC field which obtained successful KGC results.
and ‘‘⋇’’ respectively indicate results from [37,50–52]. ‘‘S’’, ‘‘R’’ are Standard
We call this category of KGC approaches as Neural Network-
learning and Reciprocal learning, respectively.
based KGC Models in our summary, it also can be referred as the
non-linear models in other literatures because the nonlinear func-
tion in neural network structures, e.g., softmax function, sigmoid
We naturally plot experimental data on WN18RR and FB15K-237 activation function, etc.
datasets as Fig. 9, from which we can easily discover that both
‘ComplEx-N3’ and QuatE perform excellently in all metrics, the 3.1.2.1. Traditional neural network-based KGC models. Neural Ten-
observation demonstrates the two models own great generality sor Networks (NTN) [14] The primitive NTN averages word
and scalability. Besides, the success of QuatE also enlightens us to vectors in entity name to generate the entity vector, so that
explore the potential cooperation mode about useful techniques, entities with similar names can share the text information. NTN
such as N3 regularization, reciprocal learning, non-negativity con- can explicitly reason relations between two entity vectors in KGs.
straints (NNE), and approximate entailment constraints (AER). In NTN, the standard linear neural network layer is replaced by a
bilinear tensor layer, which is used to directly associate two entity
3.1.1.6. Discussion about TF models. Based on the above detailed
vectors in multiple dimensions and calculate a score to represent
introduction and a series of comparison and analysis on ex-
the possibility of two entities vh , vt having a certain relation r:
perimental results of these mentioned Tensor Factorization (TF)
models, we further make some conclusive discussions: g(h, r , t) = uT f (vhT Wr[1:k] vt + Vr [vh , vt ]T + br )
1. Regularization and Constraints. Generally speaking, either
imposing proper regularization or constraints on linear tensor where f = tanh() is a standard nonlinearity activation function,
[1:k]
factorization models is beneficial for KGC. and Wr ∈ Rd×d×k , Vr ∈ Rk×2d , br ∈ Rk are parameter tensors,
[1:k]
2. High-dimensional Spaces Modeling. Using rotation or other the bilinear tensor product vhT Wr vt results in a vector ve ∈ Rk .
operations to model entities and relationships in Multi Layer Perceptron (MLP) [9] is a simplified version of NTN,
high-dimensional spaces (such as QuatE and TuckER) with higher
it serves a multi-source Web-scale probabilistic knowledge base:
degrees of freedom may be a good attempt and a nice choice for
Knowledge Vault built by [9], which is much bigger than other
further exploration on KGC.
existing automatically constructed KGs. To extract reliable facts
3. Multi-task Learning. TuckER not only achieves better results
from the Web, MLP replaces the NTN’s interaction function with
than those of other linear models but also better than the results
a standard multi-layer perceptron.
of many complex algorithms belonging to other categories, such
as deep neural network models and reinforcement learning used Neural Association Model (NAM) [64] possesses multi-layer non-
architectures, e.g. ConvE [33] and MINERVA [63]. Still, since the linear activations in its deep neural nets, the objective of this
good achievements on TuckER along with ‘ComplEx-N3’, we can spacial framework is detecting association conditional proba-
deduce that although they are different in specific method details, bilities among any two possible facts. NAM can be applied to
they all enjoy the great benefit of multi-task learning. several probabilistic reasoning tasks such as triple classification,
14
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 6
Summarization and comparison of recent popular Neural Network models for KGC.
Model Technique Score Function Loss functiona Notation Datasets
Traditional neural network models:
[1:k]
NTN [14] Bilinear tensor layer g(h, r , t) = uT f (p1 + p2 + br ), max Lmarg Wr ∈ Rd×d×k , WordNet,
[1:k]
p1 = vhT Wr vt , p2 = Vr [vh , vt ]T Vr ∈ Rk×2d Freebase
f = tanh()
MLP [9] Improves NTN; s(h, r , t) = w T f (p1 + p2 + p3 ), – vh , vr , vt ∈ Rd , KV
Standard Multi-layer Perceptron p1 = M1 vh , p2 = M2 vr , p3 = M3 vt Mi ∈ Rd×d ,
w ∈ Rd ,
f = tanh()
NAM [64] Multi-layer nonlinear activations; s(h, r , t) = g(vtT u{L} ) Lll vh , vt , vr ∈ Rd , WN11,
probabilistic reasoning u{l} = f (W {l} u{l−1} + b{l} ) g = sigmoid(), FB13
u{0} = [vh , vr ] f = ReLU()
SENN [65] Embedding shared fully s(h, t) = vr ATR , s(r , t) = vh ATE , s(h, r) = vt ATE , Joint adaptively vh , vt , vr ∈ Rd , WN18,
connected neural network; vr = f (f (...f ([h; t ]Wr ,1 + br ,1 )...))Wr ,n + br ,n weighted loss AE ∈ R|E |×d , FB15K
adaptively weighted vh = f (f (...f ([r ; t ]Wh,1 + bh,1 )...))Wh,n + bh,n AR ∈ R|R|×d
loss mechanism vt = f (f (...f ([h; r ]Wt ,1 + bt ,1 )...))Wt ,n + bt ,n f = ReLU()
ParamE [66] MLP; CNN; gate structure; s(h, r , t) = ((fnn (vh ; vr ))W + b)vt LBCE vh , vt , vr ∈ Rd , FB15k-237,
embed relations as NN parameters vr = Paramfnn W ∈ Rd×n , b ∈ Rd , WN18RR
g = sigmoid(),
f = ReLU()
CNN-based KGC models:
ConvE [33] Multi-layer 2D CNN; s(h, r , t) = f (v ec(f (concat(vˆh , vˆr ) ∗ Ω ))W ) · vt b LBCE vh , vt ∈ Rd , WN18,
1-N scoring programs vˆh , vˆr ∈ Rdw ×dh ; FB15k,

vr ∈ Rd , YAGO3-10,
d = dw dh ; Countries,
f = ReLU(); FB15k-237
Ω : filter sets
InteractE Feature Permutation; s(h, r , t) = g(v ec(f (φ (Pk ) ◦ w ))W )vt c LBCE vh , vt , vr ∈ Rd , FB15K-237,
[67] Checkered Reshaping; Pi = [(vh1 , vr1 ); ...; (vhi , vri )] d = dw dh; WN18RR,
Circular Convolution f = ReLU(), YAGO3-10
g = sigmoid();
w: a filter
ConvKB [68] 1D CNN; s(h, r , t) = concat(g([vh , vr , vt ] ∗ Ω )) · W b Lnll vh , vt , vr ∈ Rd WN18RR,
Transitional characteristic; g = ReLU(), FB15k-237
L2 regularization Ω : filter sets;
CapsE [69] ConvKB; s(h, r , t) = ∥cap(g([vh , vr , vt ] ∗ Ω ))∥b Lnll vh , vr , vt ∈ Rd ; WN18RR,
capsules networks g = ReLU(); FB15k-237
Ω : filter sets;
cap() : Capsule-
Network
GCN-based KGC Models:
R-GCN [70] Basis decomposition; s(h, r , t) = vhT Wr vt LBCE vh , vt ∈ Rd ; WN18RR,
block-diagonal-decomposition; Wr ∈ Rd×d FB15k,
end-to-end framework: FB15k-237
encoder: R-GCN,
decoder: DistMult
SACN [71] End-to-end framework: s(h, r , t) = f (v ec(M(vh , vr ))W )vt – f = ReLU(); FB15k-237,
encoder: WGCN, W ∈ RCd×d ; WN18RR,
decoder: Conv-TransE M(vh , vr ) ∈ RC ×d ; FB15k-237
C : kernels number -Attr
COMPGCN Entity-relation- sConv E , sDistMult , etc. – – FB15k-237,
[72] composition operators; WN18RR
end-to-end framework:
encoder: COMPGCN,
decoder: ConvE, DistMult, etc.

(continued on next page)

recognizing textual entailment, especially responds well for com- dealing with diverse prediction tasks and various mapping styles
monsense reasoning. during the training process.

Shared Embedding based Neural Network (SENN) [65] explicitly ParamE [66] is an expressive and translational KGC model which
differentiates the prediction tasks of head-entities, relations, and regards neural network parameters as relation embeddings, while
the head entity embeddings and tail entity embeddings are re-
tail-entities by use of three respective substructures with fully-
garded as the input and output of this neural network respec-
connected neural networks in an embedding sharing manner. tively. To confirm whether ParamE is a general framework for dif-
Then the prediction-specific scores gained from substructures are ferent NN architectures, this paper designs three different NN ar-
employed to estimate the possibility of predictions. An adaptively chitectures to implement ParamE: multi-layer perceptrons (MLP),
weighted loss mechanism enables SENN to be more efficient in convolution layers, and gate structure layers, called ParamE-MLP,
15
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 6 (continued).
Model Technique Score Function Loss functiona Notation Datasets
GAN-based KGC Models:
KBGAN [28] Discriminator+generator; strans d Lmarg – FB15k-237,
negative sampling; WN18,
reinforcement learning WN18RR
IGAN [31] Discriminator+generator; strans or ssem d Lmarg – FB15K,
negative sampling; FB13,
reinforcement learning; WN11,
non-zero loss WN18
KSGAN [29] Discriminator + generator + ssem d Lmarg – FB15k-237,
selector; WN18,
negative sampling; WN18RR
reinforcement learning;
non-zero loss
a
Lll (Lnll ), Lmarg and LBCE are (negative) log likely-hood loss, margin-based ranking loss and binary cross entropy loss respectively.
b
‘∗’ means a convolution operator.
c
‘◦’ means depth-wise circular convolution.
d
‘strans ’ and ‘ssem ’ are respective the score function of translation models and semantic matching models.

Fig. 10. The summarized frameworks of several CNN-based KGC models.


Source: Figures are extracted from [33,67–69].

ParamE-CNN, ParamE-Gate. Significantly, ParamE embeds the en- the basic nonlinear transformation function, rectified linear units,
tity and relation representations in feature space and parameter for faster training [76]. ConvE owns much fewer parameters but
space respectively, this makes entities and relations be mapped is significantly efficient when modeling high-scale KGs with high
into two different spaces as expected. degree node numbers. This work also points out the test set leak-
3.1.2.2. Convolutional Neural Network (CNN)-based KGC models. age issue of WN18 and FB15k datasets, performing a comparative
We summarize some CNN-based KGC methods and draw a re- experiment on their robust variants: WN18RR and FB15K-237.
lated figure (Fig. 10) for exhibiting the whole architecture of InteractE [67] further advances ConvE by increasing the captured
them, from which we can clearly know the learning procedure interactions to heighten LP’s performance. InteractE chooses a
of these models.
novel input style in a multiple permutation manner and re-
ConvE [33] describes a multi-layer 2D convolutional network places simple feature reshaping of ConvE with the checked re-
model for LP task, which is the first attempt that uses 2D convo- shaping. Additionally, its special circular convolution structure is
lutions over graph embeddings to explore more valuable feature performed in a depth-wise manner.
interactions. ConvE defines its score function by a convolution
over 2D shaped embeddings as: ConvKB [68] is proposed after ConvE—the main difference be-
tween ConvKB and ConvE is that ConvKB uses 1D convolution
s(h, r , t) = f (v ec(f ([v¯h ; v¯r ] ∗ ω))W )vt expecting to extract global relations over the same dimensional
where the relation parameter vr ∈ Rk , v¯h and v¯r represent the 2D entries of an input triple matrix, which indicated that ConvKB
reshaping of vh and vt respectively, which conform to: both the concerns at the transitional characteristics of triples. According to
v¯h , v¯r ∈ Rkw ×kh when vh , vr ∈ Rk , where k = kw kh in which the the evaluation on two benchmark datasets: WN18RR and FB15k-
kh , kw denotes the width and height of the reshaped 2D matrix. 237, ConvKB performs better grades compared with ConvE and
The v ec() means the vectorization operation, while f () indicates some other past models, which may be due to the efficient CNN
16
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

structure as well as the design for extracting the global rela- of the graph. On this basis, COMPGCN systematically leverages
tion information so that ConvKB will not ignore the transitional entity-relation composition operations from KGE techniques to
characteristics of triples in KGs. jointly embed entities and relations in a graph. Firstly, COMPGCN
alleviates the over-parameterization problem by performing KGE
CapsE After ConvKB, Nguyen et al. [69] next present CapsE to
composition (φ (u, r)) of a neighboring node u with respect to
model triples by employing the capsule network [77], a network
whose original intention is capturing entities in images. It is its relation r, to substitute the original neighbor parameter vu
the first attempt at applying a capsule network for KGC. The in the GCNs, therefore COMPGCN is relation-aware. Additionally,
general framework of CapsE is shown in Fig. 10(d), from which to ensure that COMPGCN scales with the increasing number of
we can see after feeding to a convolution layer with multiple relations, COMPGCN shares relation embeddings across layers
filters sets Ω as ConvKB dose, the 3-column triple matrix then is and uses basis decomposition based on the basis formulations
transformed into different feature maps, and these feature maps proposed in R-GCN. Different from R-GCN which defines a sep-
are later reconstructed by two capsule layers. A routing algorithm arate set of basis matrices for each GCN layer, COMPGCN defines
extended from Sabour et al. [77] guides the routing process be- basis vectors and only for the first GCN layer, while the later
tween these two capsule layers. To that end, a continuous vector layers share the relations through the relation embedding trans-
was produced whose length can be used to compute the score formations performed by a learnable transformation matrix. This
function of the triple: makes COMPGCN more parameter efficient than R-GCN.
Recently more and more novel effective GCN methods are pro-
s(h, r , t) = ∥capsnet(g([vh , vr , vt ] ∗ Ω ))∥ posed to conduct the graph analytical tasks. To efficiently exploit
where capsnet and ∗ mean the capsule network operator and the structural properties of relational graphs, some recent works
convolution operation respectively. Experimental results confirm try to extend multi-layer GCNs to specific tasks for obtaining
that the CapsE model performs better than ConvKB [68] on proper graph representation. For example, Bi-CLKT [83] and JKT
WN18RR and FB15k-237. [84], which are both knowledge tracing methods [85], apply two-
layer GCN structure to encode node-level and global-level repre-
3.1.2.3. Graph Convolution Network (GCN)-based KGC sentations for relational subgraphs exercise-to-exercise (E2E) and
models. Graph Convolution Network (GCN) [78] was introduced concept-to-concept (C2C), respectively. The utilization of two-
as a generalization of Convolutional Neural Networks (CNNs),1 layer GCN can effectively learn the original structural information
which are a popular neural network architecture defined on from multidimensional relationship subgraphs. Besides, ie-HGCN
a graph structure [70,80,82]. Recently, lots of researchers have [86] try to learn interpretable and efficient task-specific object
employed GCNs to predict missing facts in KGs. representations by using multiple layers of heterogeneous graph
R-GCN [70] is presented as an extension of GCNs that oper- convolution on the Heterogeneous Information Network (HIN)
ate on local graph neighborhoods to accomplish KGC tasks. R- [87]. Based on these works, a possible direction of future research
GCN uses relation-specific transformations different from regular is to explore the multi-layer GCN to efficiently capture different
GCNs as the encoder side. For the LP task, the DisMult model levels of structural information of KGs for the KGC task.
was chosen to be the decoder to perform a computation of
3.1.2.4. Generative adversarial network (GAN)-based KGC models.
an edge’s score. To avoid over-fitting on sparse relations and
Generative adversarial network (GAN) [88] is one of the most
massive growth of model parameters, this work utilizes block-
promising methods for unsupervised learning on complex dis-
diagonal-decomposition methods to regularize the weights of
tribution in recent years, whose intention is originally proposed
R-GCN layers. R-GCN can act as a competitive, end-to-end train-
for generating samples in a continuous space such as images.
able graph-based encoder (just like SACN [71] shows), i.e., in
GAN usually consists of at least two modules: a generative module
LP task, the R-GCN model with DistMult factorization as the
and a discriminative module, the former accepts a noise input
decoding component outperformed direct optimization of the
factorization model and achieved competitive results on standard and outputs an image while the latter is a classifier that clas-
LP benchmarks. sifies images as ‘‘true’’ (from the ground truth set) or ‘‘fake’’
(generated by the generator), these two parts train and learn
Structure-Aware Convolutional Network (SACN) [71] is an end- together in a confrontational way. However, it is not possible to
to-end model, where the encoder uses a stack of multiple W-GCN use the original version of GANs for generating discrete samples
(Weighted GCN) layers to learn information from both graph like natural language sentences or knowledge graph triples since
structure and graph nodes’ attributes, the W-GCN framework gradients from propagation back to the generator are prevented
addresses the over-parameterization shortcoming of GCNs by by the discrete sampling step [28] until SEQGAN [89] firstly
assigning a learnable relational specific scalar weight to each gives successful solutions to this problem by using reinforcement
relation and multiplies an incoming ‘‘message’’ by this weight learning — it trains the generator using policy gradient and other
during GCN aggregation. The decoder Conv-TransE is modified tricks. Likewise, there have been arisen lots of KGC works that
based on ConvE but abolishes the reshape process of ConvE, and incorporated the GAN framework in knowledge representation
simultaneously keeps the translational property among triples. In learning. Table 7 shows the general information about the GAN-
summary, the SACN framework efficiently combines the advan- based negative sampling methods. Intuitively, we place Fig. 11 to
tages of ConvE and GCN, thus obtain a better performance than reveal the frame structure of GAN-based models.
the original ConvE model when experimenting on the benchmark
datasets FB15k-237, WN18RR. KBGAN [28] aims to employ adversarial learning to generate high-
quality negative training samples and replace formerly used uni-
COMPGCN [72] Although R-GCN and W-GCN show performance
form sampling to improve Knowledge Graph Embedding (KG em-
gains on KGC task, they are limited to embedding only the entities
bedding). As Fig. 11(a) shows, KBGAN takes KG embedding mod-
els that are probability-based and have a log-loss function as the
1 Whereas CNNs require regular structure data, such as images or sequences,
generator to supply better quality negative examples, while the
GCNs allow for irregular graph-structured data [79]. GCNs can learn to extract discriminator uses distance-based, margin-loss KG embedding
features from the given node (entity) representations and then combine these
features together to construct highly expressive entity vectors, which can further
models to generate the final KG embeddings. More specifically,
be used in a wide variety of graph-related tasks, such as graph classification [80] it expects the generator to generate negative triples (h′ , R, t ′ )
and generation [81]. that obey the probability distribution of pG (h′ , r , t ′ |h, r , t), and
17
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 7
Characteristic of several GAN-based negative sampling technologies for KGC.
Models KBGAN [21] IGAN [31] KSGAN [29]
Modules Generator, discriminator Generator, discriminator Generator, discriminator, knowledge selector
Generator Semantic matching models Neural network Translational distance models
with softmax probabilistic models with softmax probabilistic models
Discriminator Translational distance models KGE models Semantic matching models
J(θ ) = Ee∼p(e|·;θ ) [R]
∑ ∑ ∑
Generator reward RG = E(h′ ,r ,t ′ )∼pG [R] RG = E(h′ ,r ,t ′ )∼pG [R]
function (h,r ,t)∈T (h,r ,t)∈T (h′ ,r ,t ′ )∈Ts′

Discriminator reward R = −fD (h′ , r , t ′ ) − b(h, r , t) R = tanh(fr (h, t) − fr (h′ , t ′ ) + γ ) R = fD (h′ , r , t ′ )


function
exp s (h ,r ,t )
′ ′ exp f (h ,r ,t ) ′ ′
Probability distribution pG (h′ , r , t ′ |h, r , t) = ∑ exp Gs (h∗ ,r ,t ∗ ) p(e|(h, r , t), z ; θ ) = z · (e|t , r ; θ )+ pG (h′ , r , t ′ |h, r , t) = ∑ exp Gf (h∗ ,r ,t ∗ )
G G
of sampling (1 − z)· (e|h, r −1 ; θ )
Selector – – fsel (h′ , r , t ′ ) = max (fD (h′ , r , t ′ ))
(h′ ,r ,t ′ )∈Ts′

Note that those negative samples are created by the generator


and its probability distribution pG is modeled with:
exp sG (h′ , r , t ′ )
pG (h′ , r , t ′ |h, r , t) = ∑ ,
exp sG (h∗ , r , t ∗ )

(h∗ , r , t ∗ ) ∈ Neg(h, r , t)

where the sG (h′ , r , t ′ ) means the generator’s score function and


the candidate negative triples set are:

Neg(h, r , t) ⊂ {(h′ , r , t)|h′ ∈ E } ∪ {(h, r , t ′ )|t ′ ∈ E }

To enable backpropagation of errors in the generator, KBGAN re-


lies on policy gradient, a variance-reduction REINFORCE method
with a one-step reinforcement learning setting, to seamlessly
integrate the generator module and discriminator module. On the
one hand, KBGAN enhances the KG embedding considering adver-
sarial learning, on the other hand, this framework is independent
of specific embedding models so that it can be applied to a wide
range of KG embedding models and without the need for external
constraints.
GAN-based framework (IGAN) [31] is also an answer to negative
sampling in the KGC procedure, which can obtain quality negative
samples to provide non-zero loss situation for discriminator, thus
it makes full use of discriminator to operate with a margin-
based ranking loss. Different from [28], IGAN obeys a probability
distribution of the entity set E as:

p(e|(h, r , t), z ; θ ) = z · p(e|t , r ; θ ) + (1 − z) · p(e|h, r −1 ; θ )

where the binary flag z ∈ {1, 0} reflects whether to replace head


entity or tail entity. By the way, the GAN-based model is also
flexible with good adaptive capacity to be extended to various
KG embedding models. The general process of IGAN is shown in
Fig. 11(b).
Fig. 11. Several GAN-based KGC methods about negative sampling.
Source: Figures are adapted from [28,29,31] KSGAN [29] further advances KBGAN by adopting a selective
adversarial network to generate better negative examples for
training as shown in Fig. 11(c). The proposed new knowledge
assumes that the score function of the discriminator is sD (h, r , t), selective adversarial network adds a new knowledge selector
then the objective of the discriminator is to minimize the margin module to the previous adversarial network structure to en-
loss function as follows: hance the performance of discriminator and generator in KBGAN,
purposely picks out corrupted triples are of high quality from

LD = [sD (h, r , t) − sD (h′ , r , t ′ ) + γ ]+ ,
generator who has the high discriminator score:
(h,r ,t)∈T

(h′ , r , t ′ ) ∼ pG (h′ , r , t ′ |h, r , t) ssel (h′ , r , t ′ ) = max (sD (h′ , r , t ′ ))


(h′ ,r ,t ′ )∈Ts′

while the objective function of the generator is defined as a


where the picked corrupted triples compose a selection set Ts′ ,
negative distance expectation:
∑ thus the selector selects negative triples with correct semantic
RG = E[−sD (h′ , r , t ′ )], (h′ , r , t ′ ) ∼ pG (h′ , r , t ′ |h, r , t) information or close distance which can help the discriminator
(h,r ,t)∈T to avoid zero loss during the training process.
18
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 8
Published results of Neural Network-based KGC methods. Best results are in bold.
Model WN18RR FB15K-237
MR MRR Hits@1 Hits@3 Hits@10 MR MRR Hits@1 Hits@3 Hits@10
ConvE [33]b 4464 0.456 0.419 0.470 0.531 245 0.312 0.225 0.341 0.497
ConvKB [68]a 3433 0.249 – – 0.524 309 0.243 – – 0.421
CapsE [69]a 718 0.415 – – 0.559 403 0.150 – – 0.356
InteractE [67] 5202 0.463 0.430 – 0.528 172 0.354 0.263 – 0.535
R-GCN [70]b 6700 0.123 0.080 0.137 0.207 600 0.164 0.100 0.181 0.300
SACN [71] – 0.470 0.430 0.480 0.540 – 0.350 0.260 0.390 0.540
Conv-TransE [71] – 0.460 0.430 0.470 0.520 – 0.330 0.240 0.370 0.510
SACN with FB15k-237-Attr [71] – – – – – – 0.360 0.270 0.400 0.550
COMPGCN [72] 3533 0.479 0.443 0.494 0.546 197 0.355 0.264 0.390 0.535
ParamE-MLP [66] – 0.407 0.384 0.429 0.445 – 0.314 0.240 0.339 0.459
ParamE-CNN [66] – 0.461 0.434 0.472 0.513 – 0.393 0.304 0.426 0.576
ParamE-Gate [66] – 0.489 0.462 0.506 0.538 – 0.399 0.310 0.438 0.573
KBGAN [28] – 0.215 – – 0.469 – 0.277 – – 0.458
KSGAN [29] – 0.220 – – 0.479 – 0.280 – – 0.465
a
Resulting numbers are re-evaluated by [90].
b
Resulting numbers are reported by [91], and others are taken from the original papers.

3.1.2.5. Performance analysis about neural network-based KGC mod- number of training data, which is a kind of data-driven works, there-
els. We report the published results of Neural Network-based fore they usually do not perform well when dealing with sparse
KGC approaches in Table 8 and make a simple comparison be- KG data because of its great dependence on data. Moreover, these
tween them. From Table 8 we have the following findings: kinds of models have some other shortcomings, such as low
1. Among the first four CNN-based KGC models, CapsE performs interpretation, too many parameters, and poor performance in
well on the WN18RR because (1) in CapsE, the length and orien- handling sparse KGs.
tation of each capsule in the first layer can help to model the im- With the diversity research of the KGC method, more addi-
portant entries in the corresponding dimension, so that CapsE is tional information is used in the completion work. It should be
good at handling much sparser datasets, like WN18RR. (2) CapsE noted that there are several models we previously discussed mak-
uses pre-trained Glove [92] word embeddings for initialization ing use of some additional information besides structural infor-
and uses additional information. mation. For example, the typical neural network KGC model SACN
2. R-GCN, SACN and its variants, and COMPGCN are all the ex- [71] applies a weighted graph convolutional network (WGCN)
tensions of GCNs, both SACN and COMPGCN make use of the as its encoder, which utilizes node attributes and relation types
weighted GCN to aggregate the neighbor information by the information.
learnable weights, therefore they all perform relatively consistent The widely known CNN-based KGC models have effective per-
excellent results on all datasets. Besides, ‘‘SACN with FB15k- formance that benefit from the strong expressiveness of neural
237-Attr’’ uses additional attribute information in the FB15k-237 networks. Typically, the ConvE and ConvKB tend to be applied
dataset, which further results in higher results on the FB15k-237. as the decoder model in lots of KGC methods (such as [72,91])
3. We observe that the ‘‘ParamE-Gate’’ basically outperforms all to conduct KGC. So also, there are other various neural network
the other neural network models, obviously reflects in the MRR, families that have been widely applied working with different
Hits@1, and Hits@3 metrics on both datasets. Note that ConvE and additional information for conducting KGC. Take the recurrent
ParamE-CNN have similar network architectures, but ParamE- neural network (RNN) as an example, because of its superior
CNN achieves a substantial improvement over ConvE. ParamE- ability to learn sequence features, RNN often is used in the re-
CNN takes parameters in itself as relation embeddings, which lational path-based KGC methods and also be exploited to deal
can capture the intrinsic property and is more reasonable [66]. with long text information (e.g., entity description text) for KGC.
The performance comparison among ‘‘ParamE-MLP’’, ‘‘ParamE- Similarly, CNN can be regarded as a feature extractor for textual
CNN’’ and ‘‘ParamE-Gate’’ shows that MLP has a weaker modeling feature modeling in KGC procedure substituting RNN structure
ability than convolution layers and the gate structure. More- (e.g., [93–95]). Zia et al. [96] is also an example that involves
over, although convolution layers are good at extracting fea- GAN structure combined with path information, which will be
tures, ‘‘ParamE-CNN’’ performs worse than ‘‘ParamE-Gate’’ be- introduced in detail in the subsequent additional information
cause the gate structure can optionally let some useful informa- based KGC methods.
tion through. In addition, although the differences between the
FB15k-237 dataset and the WN18RR dataset let some models get 3.2. Translation models
un-balanced performance for the two datasets, ParamE-Gate can
work well in both datasets.
As a family of methods concentrating on distributed represen-
3.1.2.6. Discussion on Neural Network Models. Also be known as tation learning for KGs, translation models are both straightfor-
non linear models, the neural network KGC models relying on ward and have satisfied performance on KGC, they are promising
neural network structure (along with the non-linear Activation to encode entities as low dimensional embeddings and relations
Function, such as sigmoid function, tanh function, Rectified Linear between entities as translation vectors. This kind of model usually
Unit (ReLU) function etc., this situation can be seen from Table 6) defines a relation-dependent translation scoring function to mea-
to learn deep potential features. sure the probability of a triple through the distance metric. In
Many literatures on KGE use neural networks to represent the ordinary sense, the distance score reflects the correctness of
KGs in low-dimensional continuous space [11,14,15,64]. It can a triple (h, r , t), and more generally, it collocates with a margin-
effectively extract hidden latent features needed for knowledge based ranking loss for learning the translation relation between
reasoning with strong accuracy, high reasoning scalability, and entities. We also list a brief table about the basic characteristics
efficiency. However, neural network KGC models rely on a large of introduced translation models in Table 9.
19
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 9
Summarization and comparison about Translation models for KGC.
Model Highlights Score Function Notion Difination Loss Objectivea Datasetsb
TransE Extensions:
TransE Precursory s(h, r , t) = ∥vh + vr − vt ∥ vh , vr , vr ∈ Rd Lmarg LP: WN, FB15K,
[11] translation method FB1M
TransH Performs s(h, r , t) = ∥vh⊥ + vr − vt⊥ ∥, vh , vr , vr ∈ Rd ; Lmarg LP: WN18, FB15k;
[15] translation in vh⊥ = vh − wrT vh wr , wr ∈ Rd TC: WN11, FB13,
relation-specific vt⊥ = vt − w vt wr T
r FB15K
hyperplane
TransR Converts entity s(h, r , t) = ∥Mr vh + vr − Mr vt ∥ vh , vt ∈ Rd , vr ∈ Lmarg LP: WN18, FB15k;
[12] space to relation Rk ; TC: WN11, FB13,
space Mr ∈ Rk×d FB15K
relational space
projection
TransD Different relational s(h, r , t) = ∥Mrh vh + vr − Mrt vt ∥ vh , vt , vhp , vtp ∈ Lmarg LP: WN18, FB15k;
[97] mapping Mrh = vrp vhTp + I k×d , Rd ; TC: WN11, FB13,
matrix to head Mrh = vrp vhTp + I k×d vr , vrp ∈ Rk ; FB15k
and tail entity; Mrh , Mrt ∈ Rk×d
vector
multiplication
lppTransD Role-specific s(h, r , t) = ∥Mrh

vh + vr − Mrt′ vt ∥ vh , vt , vhp , vtp ∈ Lmarg LP: WN18, FB15K;
[98] projection ′
Mrh = vrph vhTp + I k×d , Rd ; TC: WN11, FB13,
Mrt′ = vrpt vtTp + I k×d vr , vrph , vrpt ∈ Rk ; FB15K
Mrh , Mrt ∈ Rk×d
TransF Light weight and s(h, r , t) = ∥Mrh vh + vr − Mrt vt ∥, vh , vr ∈ Rd , vr ∈ Lmarg LP: FB15k, WN18;
[99] robust; f
(i) (i) Rk ; TC: FB15k-237,
α

explicitly model Mrh = r U + I, U (i) , V (i) ∈ Rk×d WN18RR
i=1
basis subspaces f Mrh , Mrh ∈ Rk×d
βr(i) V (i) + I

of projection Mrt =
matrices i=1

STransE SE+TransE s(h, r , t) = ∥Wr ,1 vh + vr − Wr ,2 vt ∥ vh , vr , vr ∈ Rd , Lmarg LP: WN18, FB15k


[100] Wr ,1 , Wr ,2 ∈ Rd×d
Trans-FT Flexible translation s(h, r , t) = vh , vt , vhr , vtr ∈ Lmarg LP: WN18,FB15k;
[101] modeling (vhr + vr )T vtr + vhTr (vtr − vr ), Rd ; TC: WN11, FB13,
vhr = Mr vh , vtr = Mr vt vr , vrp ∈ Rk , FB15K
Mr ∈ Rk×d
Translation Models Using Attention Mechanism:
TransM Relational s(h, r , t) = wr ∥vh + vr − vt ∥ vh , vr , vt ∈ Rd , Lmarg LP: WN18, FB15K
[102] mapping; wr = 1
log(hr ptr +tr phr )
wr ∈ R
property-specific hr ptr : heads per
weight tail,
tr phr : tails per
head
ITransF Sparse attention s(h, r , t) = ∥vha tt + vr − vta tt ∥ vh , vr , vt ∈ Rd ; Lmarg WN18 and FB15k
[103] mechanism; vha tt = αrH · D · vh αrX ∈ [0, 1]m ,
relation concepts vta tt = αrT · D · vt IrX ∈ {0, 1}m ,
sharing αrX = SparseSoftmax(vrX , IrX ), vrX ∈ Rm , X =
X = H, T H, T ;
D ∈ Rm×d×d
TransAt Relation-related s(h, r , t) = Pr (h) + vr − Pr (t) vh , vr , vt ∈ Rd ; Lmarg LP: WN18, FB15k;
[104] entities categories; Pr (h)c = Pr (σ (rh )vh ) ar ∈ {0, 1}d TC: WN11, FB13
relation-related Pr (t)c = Pr (σ (rt )vt )
attention Pr (x) = ar ∗ vx , x = h, t
TransGate Gate structure; s(h, r , t) = ∥vhr + vr − vtr ∥ vh , vr , vt ∈ Rd Lmarg LP: WN18RR,
[105] shared xr = x ⊙ σ (z), FB15K,
discriminate z(x) = Wx ⊙ x + Wrx ⊙ r + bx FB15K-237;
mechanism x = h, t TC: WN11, FB13

(continued on next page)

3.2.1. TransE extensions among the head entity h and relation r, that are:
We introduce several prominent translation KGC models in
TransE [11] family, which are frequently summarized and cited vh + vr ≈ vt
in lots of literature. We draw a comprehensive figure exhibiting and it defines its score function as:
some representative translation models (shown as Fig. 12).
s(h, r , t) = ∥vh + vr − vt ∥l1/2
TransE [11] as a pioneer translation KGC model, can balance
both effectiveness and efficiency compared to most traditional However, the over-simplified translation assumption TransE
methods. TransE projects entities and relations together into a holds might constraint the performance when modeling com-
continuous low-dimensional vector space, where the tail-entity t plicated relations, which leads to a weak character that TransE
in triple (h, r , t) can be viewed as the translation operator results can only model pure 1 − 1 relations in KGs. To effectively learn
20
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 9 (continued).
Model Highlights Score Function Notion Difination Loss Objectivea Datasetsb
Modification to Loss Objection of Translation-based KGC:
TransRS Upper limit score s(h, r , t) = ∥vh + vr − vt ∥ vh , vr , vt ∈ Rd Lmarg LP: WN18, FB15k;
[106] function fr (h, t) ≤ γ ′ +Llimit TC: WN11, FB13,
for positive FB15K
triplets;
limit-based
scoring loss
TransESM Trans-RS+TransE’s s(h, r , t) = ∥vh + vr − vt ∥ vh , vr , vt ∈ Rd ; soft Lmarg A scholarly KG
[107] score function fr (h, t) ≤ γ1 , γ2 ≥ γ1 ≥ 0;
Soft Margin loss fr (h′ , t ′ ) ≥ γ2 − ξhr,t , (h′ , r ′ , t ′ ) ∈ T ′ ,
ξhr,t ≥ 0 (h, r , t) ∈ T
Transition Models in Novel Vector Space:
TransA Adaptive metric s(h, r , t) = vh , vr , vt ∈ Rd Lmarg LP: WN18, FB15K;
[108] approach; (|vh + vr − vt |)T Wr (|vh + vr − vt |) TC: WN11, FB13
.
elliptical surfaces |x| = (|x1 |, |x2 |, . . . , |xn |),
modeling xi = vhi + vri − vti
TorusE TransE+Torus s(h, r , t) = [h], [r ], [t ] ∈ T n Lmarg LP: WN18, FB15K
[109] min(x,y)∈([h]+[r ])×[t ] ∥x − y∥ T is a torus space
RotatE Entire complex s(h, r , t) = ∥vh ◦ vr − vt ∥ vh , vr , vt ∈ Cd ; Lns LP: FB15k, WN18,
[110] space C; vri = C, |vri | = 1 FB15k-237,
self-adversarial WN18RR
negative sampling
a
Put simply, the Lns and Lmarg are negative sampling loss and margin-based ranking loss respectively, also, LCmarg means a Confidence-aware margin-based ranking loss
[111], and Llimit refers to the Limit-based Scoring Loss in [106], while the LHRS is the HRS-aware loss function in [112].
b
When we describe the datasets, we apply the shorthand for: ‘LP’ means Link Prediction task, while ‘TC’ means Triple Classification task.
*** The vrc , vr′ and vrs are respective the relation cluster embedding, relation-specific embedding and sub-relation embedding in [112].
c
Pr () is a projection function.

Fig. 12. TransE and its extension models. These pictures are referred to [12,15,97–99,102,110].

21
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

complex relation types and model various KG structures, a series separate sets of basis matrices U (i) , V (i) , and the two factorized
of enhanced translation-based KGC models continuously improve projection matrices are calculated as:
the TransE. s

TransH [15] projects the entities onto the relation-specific hyper- M r ,h = αr(i) U (i) + I
plane wr (the normal vector) by h⊥ = vh − wrT vh wr or t⊥ = vt − i=1
s
wrT vt wr and then performs translation actions on this hyperplane, ∑
so that the score function is defined as follows: M r ,t = βr(i) V (i) + I
i=1
s(h, r , t) = ∥h⊥ + vr − t⊥ ∥22 Inspired by TransR, TransF is robust and lightweight enough to
which can model the 1 − n, n − 1 even n − n relations availably. deal with the large-scale KGs through easily learning multiple
relations by explicitly modeling the underlying subspace of the
TransR [12] considers that there are semantic differences be- relation’s specific projection matrix.
tween entities and relations so that they should be in different
semantic spaces. Moreover, different relations should constitute STransE [100] properly combines insights from SE [113] and
TransE [11], draws on the experience of relation-specific matrices
different semantic spaces. It converts the entity space to corre-
in SE for relation-dependent identification of both head entity
sponding relation space through a relational projection matrix
and tail entity, also follows the basic translation principle in the
Mr ∈ Rd×k , the translation performed in relation space is:
TransE model.
vh Mr + vr ≈ vt Mr Trans-FT [101] develops a general principle called Flexible Trans-
In order to better model internal complicated correlations within lation (FT), which enables it to model complex and diverse objects
diverse relation type, this work also extends TransR by incorpo- in KGs unlike those previous translation models only concentrate
rating the idea of piecewise linear regression to form Cluster- on strict restriction of translation among entities/relations (such
as TransE). Experiment adapts FT to existing translation models,
based TransR (CTransR), they introduce cluster-specific relation
TransR-FT gets the best performance compared to other two
vector rc for each entity pairs cluster and matrix Mr . However,
baselines (TransE-FT and TransH-FT).
although TransR performs well in handling complicated relation
patterns, it involves too many additional parameters to result in
3.2.2. Translation models with attention mechanism
poor robustness and scalability issues for large KGs learning.
TransM [102] is an appropriate solution to the inflexible issue
TransD [97] further advances TransR by assigning different rela-
in TransE. They focus more on the diverse contribution (i.e. var-
tional mapping matrix Mrh , Mrt ∈ Rm×n to head and tail entity ious relational mapping properties) of each training triple to the
respectively: final optimization target, therefore TransM decides to develop a
Mrh = rp hTp + I m×n weighted mechanism, with which each training triple can be as-
signed a pre-calculated distinct weight according to its relational
mapping property. In other words, we can regard this weighted
Mrt = rp tpT + I m×n operation as an attention mechanism that takes every training
example as a impact attention to tackle well with the various
h⊥ = Mrh h, t⊥ = Mrt t mapping properties of triplets.

The subscript p marks the projection vectors. Then it scoring a ITransF [103] To make full use of the shared conceptions of
relations and apply it to perform knowledge transfer effectively,
triple (h, r , t) by defining the following function:
ITransF outfits with a sparse attention mechanism to discover
s(h, r , t) = −∥h⊥ + r − t⊥ ∥22 sharing regularities for learning the interpretable sparse attention
vectors, which fully capture the hidden associations between
Thus each objects in KGs is equipped with two vectors. Addition- relations and sharing concepts.
ally, TransD replaces matrix multiplication with vector multipli-
cation which significantly increases the speed of operation. TransAt [104] effectively learns the translation-based embedding
using a reasonable attention mechanism, it exploits a piecewise
lppTransD [98] is an extension of TransD, which accounts for evaluation function which divides the KGC problem into a two-
different roles of head and tail entities. They indicated that logical stage process: checking whether the categories of head and tail
properties of relations like transitivity and symmetry cannot be entities with respect to a given relation make sense firstly, and
represented by using the same projection matrix for both head then considering for those possible compositions, whether the
and tail entities [99]. To preserve these logical properties, the lpp- relation holds under the relation-related dimensions (attributes).
series ideas consider a role-specific projection that maps an entity During this two-stage process, TransAt uses K-means to cluster
to a distinct vector according to its role in a triple, whether is a generating categories for generality. TransAt sets the projection
head entity or a tail entity. The concrete mapping matrices are function by computing the variances between head (tail) entities
designed as: associated with relation r in the training set for each dimension,
additionally, it designs a threshold to determine whether a di-

Mrh = rph hTp + I m×n mension should be retained. In consideration of the ORC structure
problems [114], TransAt utilizes an asymmetric operation on both
Mrt′ = rpt tpT + I m×n head entity and tail entity, therefore the same entities will have
different representations of head position and tail position.
TransF [99] is similar to lppTransD, which also applies the same TransGate [105] pays close attention to inherent relevance be-
idea to compute the projection matrices for head and tail entities tween relations. To learn more expressive features and reduce
separately. The difference between lppTransD and TransF is that parameters simultaneously, TransGate follows the thought of pa-
TransF mitigates the burden of relation projection by explicitly rameter sharing using gate structure and then integrates the
modeling the basis subspaces of projection matrices with two shared discriminate mechanism into its architecture to ensure
22
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

that the space complexity is the same as indiscriminate models.


The shared gates above-mentioned also be reconstructed with
weight vectors to avoid matrix–vector multiplication operations,
impelling the model to be more effective and scalable.

3.2.3. Modification to loss objection of translation-based KGC


Some translation models try to improve KGC by modifying the
objective functions [106,107]. In order to facilitate the comparison
among these improved loss programs, we can directly see Table 9,
from which we can easily pick out them by their distinctive loss
objectives.
TransRS [106] explores a limit-based scoring loss LS to provide
an upper limit score of a positive triple and then adds this limit-
based scoring loss item into the original loss function as a new
loss function for optimizations. By this mean, the modified loss Fig. 13. Visualization of TransE embedding vectors for Freebase with PCA
objective including two terms, a limit-based scoring loss as well dimension reduction. The navy crosses are the matched tail entities for an
actor’s award nominee, while the red circles are the unmatched ones. TransE
as the original margin-based ranking loss LR , that is:
applies Euclidean metric and spherical equipotential surfaces and making seven
LRS = LR + λLS , (λ > 0) mistakes as (a) shows, while TransA takes advantage of adaptive Mahalanobis
metric and elliptical equipotential surfaces, avoiding four mistakes in (b) [108].
When applied the loss to the traditional translation baselines
such as TransE and TransH (i.e., TransE-RS and TransH-RS), it
achieves remarkable performance improvements compared with
initial models.
TransESM [107] not only changes the score function and loss
function of Trans-RS into TransE’s score function with Soft Mar-
gins (Margin Ranking Loss) where soft margins allow false-negative
samples to slightly slide into the margin, mitigating the adverse
effects of false-negative samples, but also indicates that most
existing methods are tested on datasets such as Freebase and
WordNet, which may prevent the development of KGC technol-
ogy. Therefore, they verify the TransESM and compares TransE
with other models on the specific field datasets (faculty KG,
academic KG), then found that TransE is better than ComplEx [43],
TransH [15] and TransR [12] on these specific field datasets.
Fig. 14. Visualization of embeddings on 2-dimensional torus obtained by TorusE.
3.2.4. Transition models in novel vector space Embeddings of the triples (A, r , A′ ) and (B, r , B′ ) are illustrated. Note that
Most of the translation distance models tend to leverage [A′ ] − [A] and [B′ ] − [B] are similar on the torus [109].
spherical equipotential hyper-surfaces with different plausibility.
Unfortunately, the over-simplified loss metric they use limits
their ability about modeling complex relational data in KGs. As
the Abelian Lie group is a special case of Lie group when the
shown in Fig. 13, on the equipotential hyper-surfaces, more near
operation of multiplication is commutative, and it satisfies all
to the center, more plausible the triple is, thus it is difficult to
the conditions that an embedding space should require according
correctly identify the matched answer entities from unmatched
to TransE’s embedding strategy above all. TorusE defines three
ones. As the common scene in KGs, complex relations (including
types of scoring functions fL1 , fL2 and feL2 exploiting the distance
1−to−n, n−to−1, and n−to−n relations) always require complex
functions. TorusE has good performance in the LP task, in addition
embedding topologies techniques. Although complex embedding
to that, it has some other excellent characters, for instance, it not
is an urgent challenge, the existing translation methods are not
only has good computing performance but also possesses high
satisfied for this task because of the inflexibility of spherical
scalability.
equipotential hyper-surfaces.
RotatE [110] Inspired by Euler’s identity, RotatE is defined on
TransA [108] More than modeling on a traditional spherical sur-
an entire complex space, which has much more representation
face, TransA applies an adaptive and flexible metric on an el-
capacity than the above Lie group-based ToursE model, whereas
liptical surface for KG embedding. TransA not only represents
the latter sets its embeddings to be fixed, which can be regarded
the complex embedding topologies induced by complex relations
as a special case of RotatE. RotatE can uniformly model and infer
well, but also can suppress the noise from unrelated dimensions
three relation patterns: symmetry/antisymmetry, inversion, and
as the TransA itself could be treated as weighting transformed
composition, and defines each relation as a rotation on the entire
feature dimensions in Adaptive Metric Approach.
complex space. Moreover, RotatE develops a novel self-adversarial
TorusE [109] transforms the real vector space into a torus (a negative sampling technique to train the model effectively.
compact Abelian Lie group painted as Fig. 14), and keeps the same
principle as TransE simultaneously. TorusE is proposed to over- 3.2.5. Performance analysis about translation models
come the TransE’s regularization flaw that regularization conflicts We make a simple comparison on those translation models for
with the translated-embedding principle and reduces the accu- KGC and report the results of them in Table 10, from which we
racy in LP task, meanwhile. A Lie group is a group that is also have the following findings:
a finite-dimensional smooth manifold, in which the group oper- 1. From the experimental results of the TransE and its extension
ations of multiplication and inversion are smooth maps, while models TransH, TransR, TransD, IppTransD, TransF, STransE, and
23
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 10
Published link prediction results of translation models. Best results are in bold.
Model WN18 FB15K WN18RR FB15K-237
MR MRR Hits@10 MR MRR Hits@10 MR MRR Hits@10 MR MRR Hits@10
TransE [11] 251 – 0.892 125 – 0.471 2300a 0.243a 0.532a 323a 0.279a 0.441a
TransH [15] 303 – 0.867 87 – 0.644 – – – – – –
TransR [12] 225 – 0.920 77 – 0.687 – – – – – –
TransD [97] 212 – 0.922 91 – 0.773 – – – – – –
lppTransD [98] 270 – 0.943 78 – 0.787 – – – – – –
TransF [99] 198 0.856 0.953 62 0.564 0.823 3246 0.505 0.498 210 0.286 0.472
STransE [100] 206 0.657 0.934 69 0.543 0.797 – – – – – –
Trans-FT [101] 342 – 0.953 49 – 0.735 – – – – – –
TransM [102] 281 – 0.854 94 – 0.551 – – – – – –
ITransF [103] 223 – 0.952 77 – 0.814 – – – – – –
TransAt [104] 157 – 0.950 82 – 0.782 – – – – – –
TransRS [106] 357 – 0.945 77 – 0.750 – – – – – –
TransA [108] 392 – 0.943 74 – 0.804 – – – – – –
TorusE [109] – 0.947 0.954 – 0.733 0.832 – – – – – –
RotatE [110] 309 0.949 0.959 40 0.797 0.884 3340 0.476 0.571 177 0.388 0.533
TransGate [105] – – – 33 0.832 0.914 3420 0.409 0.510 177 0.404 0.581
a
Resulting numbers are reported by [91] and others are taken from the original papers.

Trans-FT, we can conclude that: (1) Based on the translation since that WN18RR removes reverse relations and destroys the
idea of TransE, for a triple (h, r , t), it is necessary to further inherent structure of WordNet, which results in low relevance
consider the semantic differences between entities and relations. between relations and further reduces the effect of parameter
(2) TransF achieves a clear and substantial improvement over sharing [105].
others in this series. The reason is that TransF factorizes the
relation space as a combination of multiple sub-spaces for repre- 3.2.6. Discussion on translation models
senting different types of relations in KGs. Besides, TransF is more In summary, the translation models based on internal struc-
robust and efficient than congeneric methods by modeling the ture information are simple but surprisingly effective when solv-
underlying subspace of the relation’s specific projection matrix ing the KGC problems. Additionally, the translation models only
for explicitly learning various relations. need few parameters. At present, translation models usually are
2. The attention-based methods TransM, ITransF, and TransAt served as the basis for extended models that exploit a wider variety
almost consistently outperform TransE. Specifically, ITransF per- of additional information sources, which benefits from the easy-to-
forms better on most of the metrics of WN18 and FB15k, while use translation transformation hypothesis. Ordinarily, collaborate
TransM has a poor result on the sparser WN18 dataset. The transitional characteristics with additional information to con-
reason is that ITransF employs a sparse attention mechanism duct KGC is an ongoing trend. This bunch of methods take account
to encourage conceptions of relations sharing across different of other useful information instead of only utilizing the inner
relations, which primarily benefit facts associated with rare re- structure information, based on the translation distance classic
lations. TransAt focuses on the hierarchical structure among the baselines or follow the basic translation assumption thought. For
attributes in an entity, so it utilizes a two-stage discriminative instance, OTE [115] advances RotatE in two ways: (1) leverag-
method to achieve an attention mechanism. It suggests that the ing orthogonal transforms [116] to extend the RotatE from 2D
proper attention mechanism can help to fit the human cognition complex domain to high dimension space for improving mod-
of a hierarchical routine effectively. eling ability, and (2) making use of the context information of
3. Both TorusE and RotatE get good performance on the WN18 nodes. PTransE (path-based TransE) [12] and PTransD [117] are
and FB15k. RotatE is good at modeling and inferring three types of both the path-augmented translation based models, while TransN
relation patterns: the symmetry pattern, the composition pattern, [31] considers the dependencies between triples and incorporates
and the inversion pattern, by defining each relation as a rotation neighbor information dynamically. On the other hand, people
in complex vector spaces. By comparison, TorusE focuses on the begin to explore how to implement the basic translation transfor-
problem of regularization in TransE. Although TorusE can be mation of entities and relations in a more effective and reasonable
regarded as a special case of RotatE since it defines KG embed- modeling space to easily model complex types of entities and
dings as translations on a compact Lie group, the modulus of relations and various structural information. Under this case, the
embeddings in TorusE are set fixed, while in RotatE is defined improvement and optimization of the loss function is also a
on the entire complex space, which is very critical for modeling promising research direction.
and inferring the composition patterns. Therefore, RotatE has
much more representation capacity than TorusE, which may help 4. Additional information-based KGC technologies
explain why RotatE gains better performance than TorusE on the
WN18 and FB15k. The research on additional information-based KGC has re-
4. TransGate achieves excellent performance on four datasets, ceived increasing attention in recent years. The techniques as sur-
especially in the metrics of FB15k and FB15k-237. These results veyed in Section 3 perform KGC mainly relying on the structure
show the appropriateness of sharing discriminate parameters and information of KGs (i.e., the simple triple structure information),
the great ability of gate structure. Actually, TransGates is a better of course, several methods mentioned in Section 3 also simulta-
trade-off between the complexity and the expressivity by follow- neously utilize the additional information for KGC. For example,
ing the parameter sharing strategy. With the help of the shared KBAT [91] considers the multi-hop neighborhood information
discriminate mechanism based on the gate structure, TransGate of a given entity to capture entity and relation features, and
can optimize embeddings and reduce parameters simultaneously. DrWT [45] leverages the additional Wikipedia page document of
However, TransGate has a poorer performance on the WN18RR, entities outside KGs. In this section, we focus on the additional
24
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

information-based KGC techniques, and make a comprehensive


and fine-grained summarization and comparison.
We focused specifically on the incorporation of two types of
additional information, including internal side information inside
KGs and external extra information outside KGs:

• We introduce the usage of internal side information inside


KGs in Section 4.1, which consists of five subclasses: node
attributes information (in Section 4.1.1), entity-related in-
formation (in Section 4.1.2), relation-related information (in
Section 4.1.3), neighborhood information (in Section 4.1.4)
and path information (in Section 4.1.5).
• The investigations on incorporating external information Fig. 15. Some examples of entity images. This image is referred from [122].
outside KGs are in Section 4.2, which involves two aspects
of contents: rule-based KGC in Section 4.2.1, and third-party
data sources-based KGC in Section 4.2.2.
4.1.1.2. Text attribute information. As an important supplement
to structured information in KGs, internal semantic information,
4.1. Internal side information inside KGs
e.g., text attribute information such as literal names of nodes or
edges, is adapted in many KGC studies. Earlier NTN [14] for KGC
The inherent rich information (i.e., internal information) in-
leverages entity names by averaging the embeddings of words
side KGs often is used during KG learning, these non-negligible
involved in them, hoping to achieve semantic sharing among
information plays an important role in capturing useful features
those learned vectors. Inspired by this idea, many relative KGC
of knowledge embeddings for KGC and knowledge-aware appli-
works have sprouted up to explore the usage of text attributes.
cations. In general, the common internal side information inside
KGs includes node attributes information, entity-related informa- JointAS [15] and JointTS [121] propose novel KGE methods which
tion, relation-related information, neighborhood information, and jointly embed entities and words in entity names into a same
relational path information. continuous vector space.
4.1.1.3. Image atrribute information. Since image attributes asso-
4.1.1. Node attributes information
ciated with entities could provide significant visual information
Nodes in KGs usually carry rich attribute information, this
for KG learning, entity images also have been used to enhance KG
information often explains and reflects the characteristics of en-
embedding in some works. Fig. 15 demonstrates some examples
tities. For example, the gender, age and appearance of a person are
of entity images. In KGs, each entity may have multiple images
respectively corresponding to the textual attribute, non-discrete
that intuitively describe the appearances and behaviors of this
digital attribute, and image attribute — they are the mainstream
entity in a visual manner.
attribute information, which are usually exploited by cooperating
The representative IKRL [122] designs a specialized image
with structure information of KGs to jointly learn KG embeddings.
encoder to generate the image-based representation for each
Although attribute information of entities is important to under-
image instance and jointly learn the KG representations with
stand the entity and may help to alleviate the inherent sparsity
translation-based methods. To consider all image instances of an
and incompleteness problem that are prevalent in KGs [41], there
entity and further aggregate their image-based representation for
is still less literature concern about attribute information when
each entity, they use an attention-based method to construct the
performing KGC task. We summarize KGC methods using node at-
aggregated entity embeddings. There also exists some literature
tribute, pay close attention to the usage of the numeric attribute,
that employs multiple kinds of attribute information of nodes
text attribute, and image attribute. The general characteristics of
involves image information, such as the similar translation-based
these methods are compared and listed in Table 11.
method Visual and Linguistic Representation Model (VALR)
4.1.1.1. Numeric atrribute information. Numeric attribute infor- [123] combines linguistic representations and visual representa-
mation is a kind of available internal information for KG learning. tions of entities to learn entity embeddings.
Many popular KGs such as Freebase, YAGO, or DBPedia main-
4.1.1.4. Multi-model atrribute information. Some literatures at-
tain a list of non-discrete attributes for each entity. Intuitively,
tempt to learn KG embedding utilizing multi-model data includ-
these attributes such as height, price, or population count are
ing various factors: text, images, numerical values, categorical
able to richly characterize entities in KGs. Unfortunately, many
values, and etc.
state-of-the-art KGC models ignore this information due to the
challenging nature of dealing with non-discrete data in inherently VALR [123] considers multi-modal information for learning en-
binary-natured KGs. tity embeddings. Based on the work of [122], apart from the
entity images, VALR integrates linguistic representation of enti-
KBLRN Garcia et al. [118] firstly integrate latent, relational and
ties, it builds the score function upon the foundations of TransE
numerical features of KGs for KGC with the support of new
and designs it as the sum of sub-energy functions that leverage
proposed end-to-end model KBLRN.
both multi-modal (visual and linguistic) and structural informa-
MTKGNN [119] is a multi-task learning approach constructed tion, which may properly learn new multi-modal representations.
by a deep learning architecture, which not only leverages non- VALR builds an easily extensible neural network architecture to
discrete attribute information in KGC but also aims to predict that train the model.
numerical attributes.
Multi-modal knowledge base embeddings (MKBE) [124] fo-
TransEA [120] consists of two component modules, a structure cuses on the multimodel relational data for link prediction task,
embedding model and an attribute embedding model. TransEA introduced a novel link prediction model named multi-modal
extends TransE [11] with numeric attributes embedding by knowledge base embeddings (MKBE). MKBE consists of an en-
adding a numerical attribute prediction loss to the original re- coder and a decoder, the encoder employs multiple different
lational loss of TransE. neural structures according to the different multimodel evidence
25
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 11
Characteristics of KGC methods using nodes’ attributes information.
Model Highlights Nodes information Jointly learning expression Datasets
KGC using numeric atrribute information
logp((h, r , t)|θ1 , . . . , θn )a

KBLRN [118] End-to-end jointly training model; Numerical attributes L=− FB15k-num,
multi-task learning; (h,r ,t)∈T FB15k-237-num
feature types-combining approach
MTKGNN End-to-end multi-task NN Numeric attributes Lattr = Lhead + Ltail YG24K, FB28K
[119] Lhead = MSE(gh (ai ), (ai )∗ )
Ltail = MSE(gt (aj ), (aj )∗ )
TransEA [120] TransE + numerical attributes Numeric attributes L = (1 − α ) · LTransE + α · LA YG58K, FB15K
LTransE : TransE loss; LA : attribute loss
KGC using textual atrribute information
JointAs [15] Jointly neural network model Node’s name, anchors L = LK + LT + LA Freebase
LK : KGC loss;
LT : Text model loss;
LA : Alignment loss
JointTs [121] Replaces anchors in JointAs with text Node’s name L = LK + LT + LAT Freebase
description LAT : text description-aware Alignment loss
KGC using image atrribute information
IKRL [122] Neural image encoder; Image attributes s(h, r , t) = sSS + sSI + sIS + sII WN9-IMG
translation-based decoder; sXY = ∥hX + r − tY ∥,
attention mechanism S(I): structure(image)-based representations
KGC using multi-model atrribute information
VALR [123] Linguistic embeddings; Text attributes, image s(h, r , t) = sS + sM1 + sM2 + sSM + sMS FB-IMG, WN9-IMG
neural network architecture; attributes sS = ∥hS + rS − tS ∥,
multi-model additional energy function sM1 = ∥hM + rS − tM ∥,
sM2 = ∥(hM + hS ) + rS − (tM + tS )∥,
sSM = ∥hS + rS − tM ∥,
sMS = ∥hM + rS − tS ∥
S /M: structure/multi-modal representations
∑∑ h, r h, r h, r h, r
MKBE [124] Feature type specific encoders/decoders; Text attributes, images L= lt log(pt ) + (1 − lt )log(1 − pt ) YAGO-10
DistMult/ConvE; attributes, numeric (h,r) t
h,r
multi-modal KGs modeling; attributes pt = σ s(h, r , t),
h,r
VGG pretrained network on ImageNet lt : a binary label
logp((h, r , t)|θ1 , . . . , θn )a

MMKG [125] Relational reasoning across different entities Numeric attributes, L=− DB15K, YAGO15K,
and images images attributes (h,r ,t)∈T FB15K
LiteralE [126] End-to-end universal extension module Numeric attributes, text sX (h, r , t) → sX (g(h, lh ), r , g(t , lt )) FB15k, FB15k-237,
attributes g(): a gated function; X : specific KGE models YAGO-10
a
θi : the parameters of individual model.

types to embed multimodel data that link prediction task used,


while different neural decoders distinguished by missing multi-
4.1.1.5. Discussion on KGC methods using node’s attribute infor-
model relational data types use the learned entity embeddings mation. Datasets: From Table 12, we dabble in several datasets
to achieve multimodel attributes recovery. Experiments demon- which are rich in attribute data. Liu et al. [125] introduce a
strate the effectiveness of MKBE based on both the Distmult and collection of Dbpedia15K, YAGO15K, and FB15K that contain both
the ConvE scoring functions on two new datasets generated by numerical features and image links for all entities in KGs. The
extending the exiting datasets, YAGO-10 and MovieLens-100k. WN9-IMG dataset in [122] contains a subset of WordNet synsets,
This paper proves a variety of relational data types can provide which are linked according to a pre-defined set of linguistic rela-
abundant evidence for link prediction task, and made a successful tions, e.g. hypernym. Based on Freebase, Mousselly-Sergieh et al.
attempt to use the multimodel information in a unified model. [123] develop a novel large-scale dataset, FB-IMG, for multimodal
KGC. The FB-IMG dataset can better resemble the characteristics
Multi-Modal Knowledge Graphs (MMKG) [125] is a
of real KG because it has a much larger number of relations,
visual-relational resource collection of three KGs for KGC, which
entities, and triples compared to WN9-IMG (cf. Table 12). Besides,
is constructed relying on FB15K and is enriched with numeric
Garcia-Duran et al. [118] create two special datasets referred to
literals and image information. MMKG extends KBLRN [118] by
as FB15k-num and FB15k-237-num by adding numerical features
adding image information to this learning framework.
on the original KGC benchmark FB15K.
LiteralE [126] also attaches importance to rich literal attributes Jointly learning: In this end, we discuss the general situation
of nodes, especially non-discrete values, and learns entity em- of jointly learning in these KGC works using attribute data. We
beddings by incorporating attribute information via a portable can easily find that the attribute of nodes is used less singly. On
parameterized function. Although LiteralE plays emphasis on nu- the contrary, they tend to be combined and interacted with each
merical attributes, it points out that textual or image feature can other as mentioned by multi-modal data. The above-mentioned
fit the incorporation principle as well for jointly learning literal- IKRL [122] is a classic multimodal data used method that incor-
enriched embedding. Additionally, LiteralE explores the effect of porates both visual and structural information. Since the node’s
utilizing multiple attribute features among relational data, and attribute features are a kind of additional diversified information,
constructs a new large-scale dataset for multi-modal KGC based such KGC works tend to jointly learn original structure models
on Freebase. and additional attribute models. As a consequential result, they
26
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 12 refers to those type hierarchy features of a given entity, they


Statistics of several nodes’ attributes datasets. are shared among similar entities) into KG embedding in a deep
Dataset Entity Relation #Rel KG #Numeral #Images learning framework, which enables it to predict unseen facts in
Dbpedia15K 14 777 279 99 028 46 121 12 841 the training process (referred as fresh entities).
YAGO15K 15 283 32 122 886 48 405 11 194
Dataset Entity Relation #Train #Valid #Test 4.1.2.2. Entity hierarchy taxonomic information. The entity hier-
WN9-IMG 6555 9 11 741 1337 1319
archy taxonomic information is a hierarchy of entity categories.
FB-IMG 11 757 1231 285 850 29 580 34 863 Categories in different levels reflect the similarity in different
FB15k-num 14 951 1345 483 142 5156 6012 granularities. Each vertex is assigned a path (from the root to
FB15k-237-num 14 541 237 272 115 1058 1215 a leaf) in the hierarchy [129]. The neighborhood structure of a
vertex is usually closely related to an underlying hierarchical
taxonomy: the vertices are associated with successively broader
are likely to design a combined scoring system or optimize a joint categories that can be organized hierarchically [134]. The hi-
loss objective. We uniformity summarize their loss functions in erarchical taxonomic of entity allows the information to flow
Table 11, from which we can easily conclude that they mostly between vertices via their common categories so that it provides
develop their loss objective or energy function in a composition an effective mechanism for alleviating data scarcity.
form, usually extend the original definition of triple energy (dis- Entity Hierarchy Embedding (EHE) [132] learns distribution rep-
tance energy or similarity energy and so on) to consider the new resentation for entity hierarchy by designing a distance matrix
multimodal representations. for each entity node. The aggregated metrics encode entity hier-
archical information to obtain hierarchy embeddings, which can
4.1.2. Entity-related information significantly capture abundant semantic for KGC.
Entity-related information includes entity types and seman-
Semantically Smooth Embedding (SSE) [133] takes advantage
tic hierarchical taxonomic information of entities. We uniformly
of additional semantic information, e.g., entity semantic cate-
summary this part of works in Table 13.
gories, and restrains the geometric structure of the embedding
In KGs, entity types are the side information that commonly
space to be consistent with observed facts. They semantically
exists and dictates whether some entities are legitimate argu-
smooth under a smoothness assumption that leverages two var-
ments of a given predicate [127]. For instance, suppose the inter-
ious learning algorithms Laplacian Eigenmaps [137] and Locally
est relation is bornin, which denotes the birth location of a person,
Linear Embedding [138]. On the one hand, the proposed smooth-
naturally we expect the asked candidate entity pairs are person-
ness assumption is portable and well-adapted in a wide variety
location type to own this relation. What is more, entity type
of KG embedding models. On the other hand, SSE regularization
information is readily available and gives assistance in avoiding
terms can be constructed by other useful additional features in
unnecessary computation led by incompatible entity-relation.
other possible embedding tasks.
4.1.2.1. Entity types information. TRESCAL [127] is a conventional NetHiex [134] is a network embedding algorithm that incorpo-
tensor decomposition approach, it regards relation extraction (RE) rates hierarchical taxonomy into network embeddings thus mod-
as a KGC task, indicating that entity type information relates to eling hierarchical taxonomy aware entity embeddings. NetHiex
KG can provide additional valuable relational domain knowledge uses a nonparametric probabilistic framework to search the most
for KGC. The novel paradigm focuses on the relevance between RE plausible hierarchical taxonomy according to the nested Chinese
and KGC, which enables the learning process to spend less time restaurant process, and then recover the network structure from
than other traditional approaches (i.e, TransE and RESCAL). network embeddings according to the Bernoulli distribution. This
framework is implemented by an efficient EM algorithm with
TCRL [128] considers entity types as hard constraints in la-
linear time complexity of each iteration, which makes NetHiex a
tent variable models for KGs. With type information, the type-
scalable model. Besides, NetHiex learns an entity representation
constraint model selects negative samples according to entity and
consists of multiple components that are associated with the
relation types. However, the type information is not explicitly
entity’s categories of diverse granularity, which alleviates data
encoded into KG representations, and their method does not
scarcity with effect.
consider the hierarchical structure of entity types. Moreover, hard
constraints may have issues with noises and incompleteness in Guided Tensor Factorization Model (GTF) [135] pays attention
type information, which is pretty common in real-world KGs. to more challenging completion of generics KGs. It applies a
knowledge guided TF method considering the taxonomy hierar-
TransT [129] combines structure information with type infor-
chy of entities and the corresponding relation schema, append-
mation and takes into account the ambiguity of entities, it dy-
ing guided quantification constraints and schema consistency on
namically generates multiple semantic vectors according to the
triple facts.
context of the entity. Moreover, TransT constructs relation types
relying on entity types, also add similarity between relative enti- SimplE+ [136] also concentrates on background taxonomic infor-
ties and relations as the prior knowledge to guide KG embedding mation about knowledge facts. [136] points out that the existing
algorithm. fully expressive TF models are less expressive in utilizing taxo-
nomic features, which is very instructive to guide LP. Considering
Feature-Rich Networks (FRNs) [130] also leverages entity type
the taxonomic information in forms of subclass and sub-property,
information and additional textual evidence for KGC on the
SimplE+ advances SimplE [44] by adding non-negativity con-
FB15k-237 dataset. They learn embeddings for manifold types,
straints to further inject subsumption content into the original
along with entities and relations from noisy resources. Their
LP method, which is a simple but effective attempt for KGC.
method to incorporate with type information has a (small) con-
tribution towards improving performance in predicting unseen
4.1.3. Relation-related information
facts.
The majority of facts in KGs possess comprehensive semantic
Ontology-Based Deep Learning Approach (OBDL) [131] recently relations, which often include transitivity and symmetry prop-
adds ontological information (where ontological information erties, as well as the type hierarchical characteristic. Take the
27
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 13
Summarization of introduced KGC methods using Entity-related information.
Model Technology Entity information Dataset
KGC using entity types information:
TRESCAL [127] a. base on RESCAL; Entity type information; textual data NELL
b. low computational complexity;
c. entity-type constraints
TCRL [128] a. entity-type constraint model; Entity type information Dbpedia-Music,
b. under closed-world assumption FB-150k,YAGOc-195k
TransT [129] a. dynamical multiple semantic vectors; Structured information; entity type information FB15k, WN18
b. entities-relations similarity as prior knowledge
FRNs [130] a. jointly modeling KGs and aligned text; Entity type information; additional textual FB15k-237
b. a composition and scoring function parameterized by evidence
a MLP
OBDL [131] a. deep learning framework (NTN); Entity type hierarchy feature; ontological WordNet, Freebase
b. a new initialization method for KGE; information
c. unseen entity prediction
KGC using entity hierarchy taxonomic information:
EHE [132] a. distance matrix; Entity hierarchy information Wikipedia snapshot
b. entity similarity measuring
SSE [133] Portable smoothness assumption: Entity semantic categories NELL_L, NELL_S, NELL_N
a. Laplacian Eigenmaps 186
b. Locally Linear Embedding.
NetHiex [134] a. a nonparametric probabilistic framework Hierarchical taxonomy information BlogCatalog, PPI, Cora,
b. nested Chinese restaurant process Citeseer
c. EM algorithm
GTF [135] a. knowledge guided tensor factorization method; Entity taxonomy hierarchy; corresponding Animals, Science
b. guided quantification constraints; relation schema
c. imposing schema consistency
SimplE+ [136] SimplE with non-negativity constraints Subclass and subproperty taxonomic WN19, FB15K, Sport,
information of entity Location

Table 14
Characteristics of introduced KGC methods using relation-related information.
Model Technologies Relation-related information Datasetsa
TranSparse [139] Complex relation-related transformation matrix Heterogeneous and imbalance characteristics LP: FB15k, WN18,
of relations FB15k-237, WN18RR
AEM [140] Relation weight Asymmetrical and imbalance characteristics of LP: WN18, FB15K;
relations TC: WN11, FB13, FB15K
Trans-HRS [112] TransE/TransH/DistMult + HRS structure Three-layer HRS structure information of LP: FB15K, WN18
relations
On2Vec [141] a. Component-specific Model encoder Hierarchical relations RP: DB3.6K,CN30K,
b. Hierarchy Model YG15K,YG60K
JOINTAe [142] a. autoencoder Compositional information of relations LP: WN18, FB15k,
b. considers relation inverse characteristic WN18RR, FB15k-237
c. based on RESAC
d. relations composition in [143]
Riemannian- a. multi-relational graph embedding Multi-relational (hypernym and synonym) TP: WN11, FB13
TransE b. Non-Euclidean Space modeling information of relations
[144] c. based on TransE
d. non-Euclidean manifold
TRE [145] Relation inference based on the triangle pattern of Entity-independent transitive relation patterns LP: FB15K, WN18,
knowledge base RP: FB15K, WN18, DBP
a
‘LP’, ‘RP’ and ‘TC’ respectively refer to Link Prediction task, Relation Prediction task and Triple Classification task.

transitivity relation pattern as an example in Fig. 16, three entities


a, b, c are connected through relations r1 , r2 , r3 . If these three
relations, no matter connected with which entity, often appear
together, then we can treat that as a transitivity relation pattern,
this pattern can be applied to an incomplete triangle to predict
the missing relation between entities d and f . Here we set out the
applications of relation-related information among KGC methods.
Table 14 gives a systematical summary for the KGC studies
using relation-related features.
Fig. 16. An example of transitivity relation pattern excerpting from [145].
4.1.3.1. Methods. TranSparse [139] Since relations in KGs are het-
erogeneous, and imbalance, TranSparse is proposed to address
this issue by introducing complex relation-related transformation
28
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

matrix [99]. TranSparse believes that the transformation matrix


should reflect the heterogeneity and unbalance of entity pair,
it changes the transformation matrix into two adaptive sparse
transfer matrices corresponding to the head entities and tail
entities.
Asymmetrical Embedding Model (AEM) [140] pays attention
to the asymmetrical and imbalanced characteristics of relations
and conducts supplement research for KGC. AEM weights each
entity vector by corresponding relation vectors according to the
role of this entity in a triple. Significantly, AEM weights each
dimension of the entity vectors, whose impact is similar to TransA
[108], can accurately represent the latent properties of entities
and relations.
Trans-HRS [112] learns knowledge representations by exploit-
ing the three-layer HRS relation structure information as an ex-
tension of existing KG embedding models TransE, TransH, and
DistMult.
On2Vec [141] is a translation-based model for dealing with spe-
cialized semantic relation facts in ontology graphs, technically Fig. 17. A neighborhood subgraph example of a KG [147].
models comprehensive relations in terms of various relation
properties, such as transitivity, symmetry, and hierarchical.
On2Vec consists of two sub-structures, one of them is a
Component-Specific Model which is charges for preserving relation
properties, the another named the Hierarchy Model aims to handle
hierarchy relations specifically. On2Vec is an effective ontology
relation prediction model which can nicely operate ontology
population by exploiting those properties or sub-properties of
semantic relations properly.
JOINTAe [142] explores a dimension reduction technique jointly
training with an auto-encoder, to better learn low dimension
interpretable relations, especially for compositional constraints. Fig. 18. A general process using neighbor information for KGC [149]. The
subgraphs G in the dashed boxes is the neighborhood graph of triple (h, r , t), and
As for the compositional constraints on relations, JOINTAe adapts
triples in G are represented by a solid edge, and triples (e.g., candidate triples)
mentioned approach in [143]. Moreover, JOINTAe considers in- not in G are represented by a dashed edge. Note that any ‘‘head prediction’’
verse relations in the training procedure and amends the score problem (?, r , t) can be converted to the ‘‘tail prediction’’ problem (t , r − , ?).
function based on RESACL.
Riemannian TransE Recently, Multi-Relation Embedding is a pop-
ular hot-spot to KGC. At this basis, Riemannian TransE [144] 4.1.4. Neighborhood information
exploits a non-Euclidean manifold in a Non-Euclidean Space to The neighbors of entity are new kinds of additional informa-
operate multi-relational graph embedding. It allots particular dis- tion containing both semantic and topological features, which
similarity criteria to each relation according to the distance in could be exploited for KGC. For instance, consider a KG fragment
Non-Euclidean space, replaces parallel vector fields in TransE with example given in Fig. 17 [147]. If we know that BenAffleck has
vector fields with an attractive point to get better embedding won an Oscaraw ard and BenAffleck lives in LosAngeles, we pre-
results, and inherits TransE’s characteristic of low complexity fer to predict that BenAffleck is an actor or a filmmaker, rather
parameter at the same time. than a teacher or a doctor. Further, if we additionally know that
Ben Affleck’s gender is male then there is a higher probability
TRE [145] is invented for completing sparse KGs, which effec-
for him to be a filmmaker. Mostly, the neighbors are utilized
tively leverages entity-independent transitive relation patterns
to form a relation-specific mixture representation as an entity
to find the patterns for infrequent entities. Though TRE briefly
vector to assist in entity learning, the general thought is shown
learns representations of relations instead of entity representa-
in Fig. 18. Although the well-known Graph Convolution Networks
tion learning as previous KGC methods, it gets high effectiveness
(GCNs) [70,82] and Graph Attention Networks (GATNs) [148] also
in predicting missing facts with low computational expensive but
learn neighborhood-based representations of nodes, they suf-
high interpretability.
fered from expensive computation and did not learn sub-optimal
4.1.3.2. Discussion on relation-related information for KGC. Why query-dependent compositions of the neighborhood. We make a
are the relation characteristics evidence becoming popular in the presentation for entity neighbor information aware KGC methods
KGC field? Firstly, the relation patterns are independent of enti- except for GCNs or GATNs. Table 15 exhibits general KGC methods
ties, so that it can predict missing relations of uncommon entities, using neighbor information.
which is helpful to alleviate the sparsity problem by improving
the completion of infrequent entities through frequent relation 4.1.4.1. Aggregating neighbors with attention mechanism. A2N
patterns [145], the conventional embedding method is hard to [150] Opposed to the early method NMM [147] (where NMM
achieve it. Secondly, compared with the embedding methods, the incorporates TransE with neighbors information to crystallize into
computational cost of identifying relation patterns is lower [146], a TransE-MRR version but only learns a fixed mixture over neigh-
because it does not need to learn the embedded representation of bors), A2N embeds query-dependent entities with corresponding
individual entities. Last but not least, relation patterns are highly neighbors into the same space via bi-linear attention on the graph
interpretable. neighborhood of an entity, to generate neighborhood informed
29
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 15
Characteristics of introduced KGC methods using neighbor information.
Model Technology Additional information Datasets
Aggregating neighbors with attention mechanism:
A2N [150] DistMult + attention scoring Neighbor structure information FB15K-237, WN18RR
LENA [149] Windowed Attentions; Neighbor structure information FB15K, FB15K-237, WN18, WN18RR
Cross-Window Pooling
LAN [151] Logic Attention Network; Relation-level information; Subject-10 and Object-10 in FB15K
end-to-end model: neighbor-level information
Encoder: LAN, Decoder: TransE
G2SKGEatt [152] Graph2Seq network; Neighbor structure information FB15K, FB15K-237, WN18, WN18RR
attention mechanism;
end-to-end model:
Encoder: Graph2Seq, Decoder: ConvE
KBAT [91] Generalized GAT; Entity’s multi-hop neighborhood FB15K-237, WN18RR, NELL-995, Kinship
end-to-end model:
Encoder: KBAT, Decoder: ConvKB
RGHAT [153] GNN; Entity’s multi-hop neighborhood FB15K, WN18, FB15K-237, WN18RR
hierarchical attention mechanism;
end-to-end model:
Encoder: RGHAT, Decoder: ConvE
Other technologies for KGC using neighbor information:
GMatching [154] Permutation-invariant network; Neighbor structure information NELL-One, Wiki-One
LSTM;
end-to-end model:
Encoder: neighbor encoder,
Decoder: matching processor
GMUC [155] Gaussian metric learning; Neighbor structure information NL27K-N0, NL27K-N1, NL27K-N2 and
few-shot UKGC; NL27K-N3
end-to-end model:
Encoder: Gaussian neighbor encoder,
Decoder: LSTM-based matching networks
NKGE [31] Dynamic Memory Network; Structure representation; FB15K, FB15K-237, WN18, WN18RR
gating mechanism; neighbor representation
end-to-end model:
Encoder: DMN, Decoder: TransE/ConvE
CACL [93] Contextual information collection; Multi-hop neighborhoods structure information FB13, FB15K, FB15K-237, WN18RR
context-aware convolutional
OTE [115] RotatE; Graph contexts representations FB15K-237, WN18RR
orthogonal transforms
CNNIM [156] Concepts of Nearest Neighbors; Neighbors information FB15k-237, JF17k, Mondial
Dempster–Shafer theory
CAFE [157] Neighborhood-aware feature set; Neighborhood-aware features FB13-A-10, WN11-AR-10, WN18-AR-10,
feature grouping technique NELL-AR-10

representation. For the attention scoring, A2N uses the DistMult neighbors. LAN meets all three significant properties: Permutation
function to project the neighbors in the same space as the target Invariant, Redundancy Aware and Query Relation Aware.
entities.
G2SKGEatt [152] develops a information fusion mechanism
Inspired by the thought of aggregating neighbors with atten-
Graph2Seq to learn embeddings that fuses sub-graph structure
tion mechanism in [150], there has generated a lot of closely
information of entities in KG. To make fusion more meaningful,
relevant studies:
G2SKGEatt formulates an attention mechanism for fusion. The 1-
Termed locality-expanded neural embedding with attention N scoring strategy proposed by ConvE [33] is used to speed up
(LENA) [149] is introduced to filter out irrelevant messages among the training and evaluation process.
neighborhoods with the support of an attentional setting. This
KBAT [91] is also an attention-based KGE model which captures
work indicates that the KG embedding relying on even sufficient
both entity and relation features in the multi-hop neighborhood
structure information is deficient since the graph data tend to
of given entity. KBAT uses ConvKB [68] as its decoder module
be heterogeneous. Therefore, LENA emphasizes that information
and specifically caters to the relation prediction (RP) task. RGHAT
involved in the graph neighborhood of an entity plays a great
[153] designs a novel hierarchical attention mechanism to com-
role in KG embedding in especially with complex heterogeneous
pute different weights for different neighboring relations and
graphs.
entities. Consider that the importance of different relations differ
Logic Attention Network (LAN) [151] is a novel KG-specific greatly in indicating an entity and to highlight the importance
neighborhood aggregator that equips attention mechanism to of different neighboring entities under the same relation, the
aggregate neighbors in a weighted combination manner. This hierarchical attention mechanism including two-level attention
work designs two mechanisms for modeling relation-level and mechanisms: a relation-level attention and an entity-level at-
neighbor-level information respective from coarse to fine: Logic tention. The relation-level attention firstly indicate an entity by
Rule Mechanism and Neural Network Mechanism, in the end, a computing the weights for different neighboring relations of it,
double-view attention is employed to incorporate these two then the entity-level attention computes the attention scores for
weighting mechanisms together in measuring the importance of different neighboring entities under each relation. Finally, each
30
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

entity aggregates information and gets updated from its neigh- instance-based learning. The application of graph pattern instead
borhood based on the hierarchical attentions. RGHAT can utilize of numerical distances makes the proposed method interpretable.
the neighborhood information of an entity more effectively with
CAFE [157] completes KGs using the sets of neighborhood-aware
the use of hierarchical attention mechanism. features to evaluate whether a candidate triple could be added
4.1.4.2. Other technologies for KGC using neighborhood information. into KGs. The proposed set of features helps to transform triples in
Some other works concern different technologies to make use of the KG into feature vectors which are further labeled and grouped
the neighborhood information. for training neural prediction models for each relation. These
models help to discern between correct triples that should be
GMatching [154] takes those one-shot relations which usually added to the KG, and incorrect ones that should be disregarded.
contain valuable information and make up a large proportion of Note that since CAFE exploits the highly connected nature of KGs
KGs into consideration, and introduces an intelligent solution to rather than requiring pre-processing of the KG, it is especially
the problem of KG sparsity caused by long-tail relations. GMatch- suitable for ever-growing KGs and dense KGs.
ing learns knowledge from one-shot relations to solve the sparsity
4.1.4.3. Discussion on KGC models using neighborhood information.
issue and further avoid retraining the embedding models when
From the above introduction and comparison about neighbor-
new relations are added into existing KGs. This model consists of
used KGC literature, we further make a basic discussion and
two components: a neighbor encoder and a matching processor,
analysis as follows:
which are responsible for encoding the local graph structure to
(1) To better obtain the neighborhood graph information, we
represent entities and calculating the similarity of two entity
need to select an appropriate fusion strategy to collect useful
pairs respectively. surrounding neighbor contexts.
GMUC [155] is a Gaussian metric learning-based method that (2) we find that most models tend to use the encoder-to-decoder
aims to complete few-shot uncertain knowledge graphs (UKGs, (end-to-end) architecture when learning neighbor information for
such as NELL and Probase, which model the uncertainty as confi- KGC, in other words, the neighbor learning part is portable which
dence scores related to facts). As the first work to study the few- could be applied to various KGE models such as translation mod-
shot uncertain knowledge graph completion (UKGC) problem, els (e.g., TransE [11], TransH [15], TransR [12]) and Bilinear mod-
GMUC uses a Gaussian neighbor encoder to learn the Gaussian- els [42,161]. We give a presentation about these end-to-end
based representation of relations and entities. Then a Gaussian structures in Table 15, and show them in Fig. 19 to illustrate this
matching function conducted by the LSTM-based matching net- intuition.
works is applied to calculate the similarity metric. The matching (3) The embedding parameters for every entity-relation pair may
be prohibitively large when the learned neighbor fusion is fixed,
similarity can be further used to predict missing facts and their
which led to the adaptable mixture methods based on the differ-
confidence scores. GMUC can effectively capture uncertain se-
ent query are more and more popular over recent years.
mantic information by employing the Gaussian-based encoder
and the metric matching function.
4.1.5. Relational path information
NKGE [31] uses a End-to-End Memory Networks (MemN2N) based In KGs, there are substantial multiple-step relation paths be-
Dynamic Memory Network (DMN) encoder [158] to extract infor- tween entities indicating their semantic relations, these relation
mation from entity neighbors, and a gating mechanism is utilized paths reflect complicated inference patterns among relations in
to integrate the structure representations and neighbor represen- KGs [12], it helps to promote the rise of the path-based relation
tations. Based on TransE [11] and ConvE [33] respectively, NKGE inference, one of the most important approaches to KGC task
designs two kinds of architectures to combine structure represen- [168]. We generally list these path-based KGC works in Table 16.
tation and neighbor representation. Experimental results show multi-hop KGC (mh-KGC): We refer to the definition in [163],
that the TransE-based model outperforms many existing trans- the mh-KGC aims at performing KGC based on existing relation
lation methods, and the ConvE-based model gets state-of-the-art paths. For example in Fig. 20, for the relation path Microsoft →
metrics on most experimental datasets. IsBasedIn → Seattle → IsLocatedIn → Washington →
IsLocatedIn → United States (as the blue lines shows), the task
Context-aware convolutional learning (CACL) [93] is a study
is to predict whether (or what) there exists direct relations that
of exploring the connection modes between entities using their
connects h and t; i.e., (Microsoft , CountryOfHQ , United States) in
neighbor contexts information, which facilitates the learning of
this case. This kind of reasoning lets us infer new or missing
entity and relation embeddings via convoluting deep learning
facts from KGs. Sometimes there can exist multiple long paths
techniques directly using the connection modes contained in each
between two entities, thus in this scene, the target relation may
multi-hop neighborhood. be inferrable from not only one path.
Orthogonal transform embedding (OTE) [115] advances RotatE
[110] in two ways: (1) leveraging orthogonal transforms [116] 4.1.5.1. Multi-hop KGC using atomic-path features. Path Ranking
to extend RotatE from 2D complex domain to high dimension Algorithm (PRA) [164] is the first work that emerges as a promis-
space in order to raise modeling ability, and (2) OTE takes account ing method for learning inference paths in large KGs, it uses
of the neighbor contexts information, effectively learns entity random walks to generate relation paths between given entity
embeddings by fusing relative graph contexts representations. pairs by depth-first search processes. The obtained paths then
Experiments contrast that with RotatE, R-GCN and A2N revealing are further encoded as relational features and combined with a
great availability of OTE. logistic regression model to learn a binary log-linear classifier to
decide whether the given query relation exists between the entity
Concepts of Nearest Neighbors-based Inference Model (CN- pairs.
NIM) [156] performs LP recognizing similar entities among com- However, millions of distinct paths in a single classifier are
mon graph patterns by the use of Concepts of Nearest Neighbors generated by the PRA method, it may supervene with feature
[159], from where Dempster–Shafer theory [160] is adapted to explosion problem because each path is treated as an atomic
draw inferences. CNNIM only spends time in the inference step feature, which makes the atomic-path idea difficult to be adopted
because it abolishes training time-wasting to keep a form of by KGs with increasing relation types [166]. Additionally, since
31
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Fig. 19. Several end-to-end KGC models using neighborhood information.


Source: These figures are extracted from [31,152,154,162].

Fig. 21. A sketch map to path feature explosion [166].

of relations to alleviate the feature explosion problem [166]: In


the work of [177], many paths are folded by clustering the paths
Fig. 20. An illustration of knowledge reasoning over paths [163]. according to the embedding degree of the relation between paths,
then it uses cluster ID to replace the original relation type. The
work [178] maps unseen paths to nearby paths seen at training
PRA must compute random walk probabilities associated with time, where the nearness is measured using the embeddings.
each path type and entity pair, resulting in proportional compu- The work [165] defines a simpler feature matrix generation al-
tation amount increase with the path number and path length. gorithm called subgraph feature extraction (SFE), it conducts a
The feature explosion issue is shown in Fig. 21. more exhaustive search, a breadth-first search instead of random
Therefore, new versions of PRA [165,177,178] try to develop walks, to characterize the local graph. Without the random walk
a series of more efficient and more expressive models related to probabilities computation, SFE can extract much more expressive
PRA. Both the first two use pre-trained vector representations features, including features that are not representable as paths in
32
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 16
Characteristics of introduced KGC methods using relational path information.
Model Technology Additional information Path selection strategy Datasets
Mh-KGC using atomic-path features:
PRA [164] Random walks Atomic path feature A single path NELL
SFE [165] Breadth-first search Atomic path feature A single path NELL
Non-atomic multi-hop reasoning:
PATH-RNN RNN + PRA, zero-shot reasoning Non-atomic and compositional path feature, Max pooling Freebase +
[166] arbitrary-length path ClueWeb
Trans-COMP Compositional training, path Non-atomic path feature A single path WordNet, Freebase
[143] compositional regularizer
Path-augmented translation models:
PTransE [12] PCRAa + TransE + path scoring Non-atomic path feature PCRA FB15K
RTransE [39] TransE + regularization Non-atomic path feature Focused on ‘‘unambiguous’’ paths: FB15K, FAMILY
composition ℓ1 : 1-to-1 or 1-to-many relations,
ℓ2 : 1-to-1 or many-to-1 relations
PTransD [117] Path-augmented TransD Path PCRA FB15K
Modeling paths using neural networks:
Single-Model Path-RNN; Shared Parameter Path, intermediate nodes, entity-types Scoring pooling: Top-K, Average and Freebase +
[167] Architecture LogSumExp ClueWeb
APCM [168] RNN + Attention Path, entity type Attentive Path Combination FC17
IRNs [169] Shared memory + controller Path, structured relation information Controller determines the length of paths WN18, FB15K
ROHP [163] Three ROPs architectures: GRUs Path Arbitrary-length path Freebase +
ClueWeb
PRCTA [170] RNN; constrained type attention; Path, entity and relation types Path-level attention Freebase +
relation-specific type constraints ClueWeb
mh-RGAN [96] RNN reasoning models + GAN Non-atomic path feature Generator G of GAN WordNet, FreeBase
Combine path information with type information:
All-Paths [171] Dynamic programming, considers Path, relation types Dynamic programming NCI-PID and
intermediate nodes WordNet.
RPE [172] Relation-specific type constraints; Path, relation type Reliable relation paths-selection strategy LP: FB15K; TC:
path-specific type constraints FB15K, FB13,
WN11
APM [173] Abstract graph + path Abstract paths, strongly typed relations Paths in abstract graph Freebase, NELL
Leveraging order information in paths:
OPTransE [174] TransE + Ordered Relation Paths Path, relation orders Path fusion: two layer pooling strategy WN18 and FB15K
PRANN [175] CNN + BiLSTM Path + entities/relations orders, entity Path-level Attention NELL995,
types FB15k-237,
Countries, Kinship
a
‘PCRA’: path-constraint resource allocation algorithm [176].

the graph at all — but the core mechanism of these three works various traditional KGC models to answer path queries. This tech-
continues to be a classifier based on atomic-path features. Be- nique is applicable to a broad class of combinable models that
sides, neither one can perform zero-shot learning because there include the bilinear model [13] and TransE [11], i.e, the score
must be a classifier for each predicted relation type in their function:
approaches.
s(s/r , t) = M(Tr (xs ), xt )
4.1.5.2. Non-atomic multi-hop reasoning. Some works explore to
represents a combinatorial form where the traversal operator
utilize path information as non-atomic features during a KGC
procedure.
Tr (xs ) means a path query (xs , r , ?) following Tr : Rd → Rd , and
operator M illustrates the incorporable model’s score operation
PATH-RNN [166] can not only jointly reason on the path, but follows M : Rd × Rd → R, for example, when cooperates with
also deduce into the vector embedded space to reason on the TransE, the traversal operator becomes to Tr (xs ) = xs + wr and
elements of paths in a non-atomic and combinatorial manner. the score function then turns into:
Using recursive neural networks (RNNs) [179] to recursively ap-
ply a composite function to describe the semantics of latent s(s/r , t) = M(Tr (xs ), xs ) = −∥Tr (xs ) − xs ∥22
relations over arbitrary length paths (in Fig. 22(a)), PATH-RNN so that it can handle a path query q = s/r1 /r2 /.../rk by:
finally produces a homologous path-vector after browsing a path.
PATH-RNN can infer from the paths not seen in the training s(q, t) = −∥xs + wr1 + · · · + wrk − xt ∥22
during the testing process, and can also deduce the relations that
The compositional training is regarded as providing a new form
do not exist in the KGs.
of structural regularization for existing models since it substan-
TransE-COMP [143] suggests a new compositional training ob- tially reduces cascading errors presented in the base vector space
jective that dramatically improves the path modeling ability of model.
33
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Fig. 22. Several RNN-structure KGC models using path information.


Source: These figures are from [96,163,166–168,170,175].

4.1.5.3. Path-augmented translation models. The path-augmented which the latter item E(h, P , t) models the inference correla-
translation methods, which introduce multi-step path informa- tions between relations with multi-step relation path triples. In
tion into classical translation models, are developed. PTranasE, relation paths p ∈ P(h, t) are represented via se-
mantic composition of relation embeddings, by perform Addition,
PTransE [12] uses path information in its energy function as: Multiplication or RNN operation:

s(h, r , t) = E(h, r , t) + E(h, P , t) Addition : p = r1 + · · · + rl


34
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Multiplication : p = r1 · ... · rl

RNN : c1 = r1 , . . . , p = cn
Simply put, PTransE doubles the number of edges in the KG by
creating reverse relations for each existing relation. Then PTransE
uses a path-constraint resource allocation algorithm (PCRA) [176] to
select reliable input paths within a given length constraint.
RTransE [39] learns compositions of relations as sequences of
translations in TransE by simply reasoning among paths, in this
process, RTransE only considers a restricted set of paths of length
two. This paper augments the training set with relevant exam-
ples of the above-mentioned compositions, and training so that Fig. 23. Example of the meaning change when the order of relations is altered.
sequences of translations lead to the desired result.
PTransD [117] is a path-augmented TransD, it thinks relation
paths as translation between entities for KGC. Similar to TransD, Final path encoding leverages path-level attention to combine
PTransD considers entities and relations into different semantics useful paths and produces path representations.
spaces. PTransD uses two vectors to represent each entity and We collect some representative structures of methods that
relations, where one of them represents the meaning of a(n) en- model path information for KGC using RNNs in Fig. 22. There
tity (relation), and another one is used to construct the dynamic are other path-based KGC models using other neural network
mapping matrix. frameworks:

4.1.5.4. Modeling paths using neural networks. We can see that Multi-hop Relation GAN (mh-RGAN) [96] considers multi-hop
neural network is handy in modeling path, especially the RNN (mh) reasoning over KGs with a generative adversarial network
application lines: (GAN) instead of training RNN reasoning models. The mh-RGAN
consists of two antagonistic components: a generator G with
Single-Model [167] Based on the PATH-RNN [166], Single-Model respect to composing a mh-RP, and a discriminator D tasked with
discusses path-based complex reasoning methods extended by
distinguishing real paths from the fake paths.
RNN and jointly reasoning with within-path relations, entities,
and entity types in the paths. 4.1.5.5. Combining path information with type information. Some
methods consider type information of entities and relations when
Attentive Path Combination Model (APCM) [168] first generates
modeling path representations, such as [167,168], and [170].
path representations using an RNN architecture, then it assigns
discriminative weights to each path representations to form the Relational Path Embedding model (RPE) Lin et al. [172] extend
representation of entity pair, finally, a dot-product operation the relation specific type constraint to the new path specific type
between the entity pair representation and the query relation constraint, both two type constraints can be seamlessly incor-
representation is designed to compute the score of a candidate porated into RPE to improve the prediction quality. In addition,
query relation, so that it allows entity pair to get representation RPE takes full advantage of the semantics of the relation path to
with respect to query relations in a dynamic manner. explicitly model KGs. Using the composite path projection, RPE
Implicitly ReasonNets (IRNs) [169] designs a network archi- can embed each entity into the proposed path space to better
tecture, which performs multi-hop reasoning in vector space handle the relations with multiple mapping characteristics.
based on shared memory. The key highlight is the employment Abstract Path Model (APM) [173] focuses on the generation of
of shared memory that intelligently saves relevant large-scale abstract graph depending on the strongly typed relations and
structured relations information in an implicit manner, thus it can then develops a traversal algorithm for mining abstract paths
avoid explicit human-designed inference. IRNs reasons according in the produced intensional graph. Those abstract paths tend
to a controller to stipulate the inference step during the whole to contain more potential patterns to execute KG tasks such as
inference procedure simultaneously gets proper interaction with LP. The proposed abstract graph drastically reduces the original
shared memory. This work performs an excellent function on KGC graph size, making it becomes more tractable to process various
about complex relations. graphs.
Recurrent one-hop predictor Model (ROHP) [163] explores 4.1.5.6. Leveraging order information in paths. The order of re-
three ROHP architectures with the capability of modeling KG lations and entities in paths is also important for reasoning.
paths of arbitrary lengths by using recurrent neural networks As Fig. 23 shows, the meaning will change when the order of
(GRUs [180]) to predict entities in the path step by step for relations is altered [174].
multi-hop KG reasoning.
OPTransE [174] attaches importance to the order information
Path-based Reasoning with Constrained Type Attention
of relations in relation paths via projecting each relation’s head
(PRCTA) equipped with a constrained type attention mechanism
entity and tail entity into different vector spaces respectively.
for multi-hop path reasoning [170]. On the one hand, PRCTA
To capture the complex and nonlinear features hidden in the
encodes type words of both entities and relations to extract
paths, OPTransE designs a multi-flow of min-pooling layers. It
abundant semantic information by which partly improves the
was experimentally validated that OPTransE performs well in LP
sparsity issue, on the other hand, for reducing the impact of
task, directly mirroring the vital role of relation order information
noisy entity types, constrained type attention is designed to
in relation paths for KGC.
softly select contributing entity types among all the types of a
certain entity in various scenarios, meanwhile, relation-specific Path-based Reasoning with Attention-aware Neural Network
type constraints are made full use for enhancing entity encoding. (PRANN) [175] also uses the ordering of the local features to learn
35
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 17
Statistics of path-based KGC datasets FC (Freebase + ClueWeb) and FC17.
Datasets FC FC17
Entities 18M 3.31M
Freebase triples 40M 35M
ClueWeb triples 12M 104M
Relations 25,994 23 612
Relation types tested 46 46
Avg. paths/relation 2.3M 3.19M
Avg. training positive/query relation – 6621
Avg. training negative/query relation – 6622
Avg. training facts/relation 6638 –
Avg. positive test instances/relation 3492 3516
Avg. negative test instances/relation 43,160 43 777
Fig. 24. An illustration that relations between an entity pair can be inferred by
considering information available in multiple paths collectively [168].

path combination operations works at a score-level, and has its


deficiency:
about the entities and the relation orderings of each path. PRANN (1) Max: Only one path is used for reasoning, while all other
explores a novel path encoding framework including a bidirec- information paths are ignored.
tional long short-term memory (BiLSTM) followed by a CNN and (2) Average: As is often the case that the path sets connecting
fusion paths works by a path-level attention mechanism. The an entity pair are very large, and only a few paths may be helpful
structure of PRANN can result in an efficient path representation for reasoning. Therefore, the model is often affected by noise.
which leads to excellent LP results. Considering multi-step rea- (3) Top-K : Different entity pairs may have different optimal
soning on paths, this paper further designs a memory module for K values. Moreover, not all Top-k paths contribute equally to
storing the time-distributed encoded path, which can repeatedly reasoning.
extract path features to enhance the prediction performance. The (4) LogSumExp: This is a smooth approximation of the ‘‘Max’’
author indicates that it is necessary to develop external memory operator, which can be seen as ‘soft’ attention, but cannot effec-
storage for storing overall paths between every entity pairs to tively integrate evidence from multiple paths.
meet the increased needs of entity pairs in current KGs. The unsatisfactory path combination situation promotes a se-
4.1.5.7. Path selection strategy. As we mentioned above, the rea- ries of effective approaches that begin to spring up. For an en-
soning path set of an entity pair usually contains more than one tity pair and the set of relation paths between them, Attentive
path, so that when we conduct KGC, we should ponder over how Path Combination Model (APCM) [168] assigns discriminative
we make use of those multiple paths, should we choose one path weights to each path to further combine these weighted path rep-
or consider all the paths? And if we choose only one path, what resentations into an entity pair level representation, Path-based
selection strategies do we need to follow? Thus it is a noteworthy Reasoning with Constrained Type Attention (PRCTA) [170] uses
issue that how to formulate an appropriate method of finding the a constrained type attention mechanism for multi-hop path rea-
most informative path under the mh-KGC task. For an example in soning, which mainly considers to alleviate the negative influence
Fig. 24, none of the four paths directly contains evidence that the of graph sparsity and entity type noise when conducting the
nationality of Steve Jobs is U.S., but when we jointly consider these reasoning procedure.
paths together, we will get much more information to support the
fact (Stev e Jobs, nationality, U .S.). 4.1.5.8. Performance analysis about path-based KGC. Datasets: We
introduce a famous dataset, Freebase + ClueWeb (called FC for
Trans-COMP [143] models only a single path between an entity
convenience) [166], for path reasoning over KGs. FC is a large-
pair, moreover, PATH-RNN [166] uses Max operator to select
scale dataset of over 52 million triples, it involves preprocessing
the path with the largest predictability at each training/testing
for multi-hop KGC (mh-KGC). The dataset is built from the com-
iteration [168]. The previous KGC methods [12,143] using relation
bination of Freebase [5] and Google’s entity linking in ClueWeb
paths neither take account of intermediate nodes nor model
[181], which contains entities and relations from Freebase and
all the relation paths since the computational expense is too
is enriched with ClueWeb text. FC is widely applied by several
expensive to enumerate all possible paths, especially in graphs
path-based KGC methods [163,166,167,170], rather than Gard-
containing text [171]. Whereby, All-Paths [171] improves upon
ner’s 1000 distinct paths per relation type, it have over 2 mil-
them by additionally modeling the intermediate entities in the
lion [166]. FC can be downloaded from http://iesl.cs.umass.edu/
path and modeling multiple paths. For a given path type referred
downloads/inferencerules/release.tar.gz. FC17 is a more recently
to in the PRUNED-PATHS approach, All-Paths uses dynamic pro-
released version to FC, in which the number of paths between
gramming to exactly build the sums of all path representations
an entity pair ranges drastically from 1 to 900 or more, so the
over node sequences.
robust of methods in comparison can be better evaluated with
However, in their method they have to store scores for inter-
this dataset. Compared with the older version, FC17 has far more
mediate path length for all entity pairs, making it prohibitive to
ClueWeb triples. Statistics of both FC and FC17 is listed in Ta-
be used in large-scale KGs. Single-Model [167] is presented to im-
ble 17.
prove the performance of Path-RNN [166]. Rather than the ‘‘max’’
pooling, Single-Model leverages various score pooling strategy: Performance Comparison: We report the existing published ex-
Top-K, Average and LogSumExp, and among which the LogSumExp perimental performance of several path-based KGC models ac-
pooling performs best. LogSumExp pooling is deemed to play the cording to different evaluation datasets in Table 18, Tables 19
same role as attention mechanism and can integrate every path and 20. By the way, we also give some analysis about presented
in trainable proportion. results.
Unfortunately, none of these methods can simulate scenarios (1) On FC and FC17 datasets: Table 18 shows experimental
in which relations can be inferred only by considering multi- results of path-based KGC methods on FC and FC17 datasets. On
ple information paths [168]. On the other hand, each of these FC, it can be observed that: overall, PRCTA outperforms all the
36
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 18 whether other approaches use additional information or not.


Experimental results of path-based KGC methods on FC and FC17 datasets. MAP Specifically, on FB15k, the Hit@10 of IRN surpasses all previous
on FC and FC17 are reported by [170] and [168], respectively. Best results are
in bold.
results by 5.7%. From Table 20 we could observe that: (a) Both
Model FC FC17
PTransE and RPE perform better than their basic model TransE
and TransR, which indicates that additional information from
MAP (%) MAP (%)
relation paths between entity pairs is helpful for link prediction.
PRA [164] 64.43 55.48
Also, OPTransE outperforms baselines which do not take relation
Path-RNN [166] 68.43 52.37
Single-Model [167] 70.11 58.28
paths into consideration in most cases. These results demonstrate
Single-Model + Type [167] 73.26 63.88 the effectiveness of taking advantage of the path features inside
Att-Model [168] 71.21 59.89 of KGs in OPTransE. (b) Except for the Hits@10 scores of IRN,
Att-Model + Type [168] 73.42 65.03 OPTransE almost performs best on all metrics compared to previ-
PRCTA [170] 76.44 –
ous path-based models like RTransE, PTransE, PaSKoGE, and RPE,
which implies that the order of relations in paths is of great im-
portance for knowledge reasoning, and learning representations
other methods in the table, which indicates the effectiveness of of ordered relation paths can significantly improve the accuracy
leveraging textual types and entity type discrimination for mh- of link prediction [174]. Moreover, the proposed pooling strategy
KGC as PRCTA does on the task of multi-hop reasoning. Besides, which aims to extract nonlinear features from different relation
as mentioned in [170], textual types and all attention mechanisms paths also contributes to the improvements of performance.
contribute to the ablation test, except that, conducting entity type
discrimination with constrained type attention also provides a 4.1.5.9. Discussion on relational path information in KGC. 1. Lim-
greater performance boosting. Notably, in terms of noise reduc- itation of path information: In multi-hop KGC (mh-KGC), the
tion, the result shows that the attention mechanisms adopted in path extracted from KGs mainly stores the structural information
PRCTA can significantly reduce noise. Specifically, the word-level of facts, which is inherently incomplete. This incompleteness
attention alleviates the representation sparseness by reducing can affect the process in different ways, e.g. it leads to rep-
noise in the whole type context, while constrained type attention resentations for nodes with few connections that are not very
further reduces noisy entity types and thus alleviates inefficiency informative, it can miss relevant patterns/paths (or derive mis-
on entities with a large number of types [170]. On FC17, there leading patterns/paths) [173]. The limited information will lead to
lacks of relevant data of PRCTA. the representation sparseness of entities and relations, resulting
Over FC17 dataset, the model ‘‘Att-Model + Type’’ [168] achieves in low discrimination for intermediate nodes, which constitutes a
the best performance. Not only the ‘AttModel’, using relations potential obstacle to the improvement of mh-KGC performance.
in the path, outperforms other methods that also use relation Therefore, it is necessary to consider other information to assist
only, but also the proposed method ’Att-Model+Types’, further reasoning, such as the semantic information of nodes in the path
considering the entities in the path by adding their types into (e.g., textual type attributes, entity or relation order information,
RNN modeling, still achieves considerable improvements than its et al.). Intuitively, incorporating knowledge from textual sources
main opponent ’Single-Model+Types’. All the comparison results by initializing the entity embeddings with a distributional rep-
above-mentioned indicate the importance of proper attention resentation of entities [182] could improve path-based relation
mechanisms. reasoning results further.
(2) On NELL995, FB15K-237, Kinship and Countries datasets: 2. Neglection on entity/relation types in mh-KGC: Although
Further, we report the data comparison on NELL995 and FB15k- previous works have introduced entities and relations types into
237 in Table 19 and observe that PRANN [175] can more accu- relational path reasoning tasks, they only consider single type en-
rately predict missing links on the large datasets compared with tities while actually, entities have more than one type, especially
other methods. Note that when it compared with the existing in different contexts, the same entities often have different types
non-path models to verify the competitiveness of the approach and semantics. Additionally, they do not distinguish entity types
in the KGC task, PRANN have achieved comparable results to the in different triples which may pose noisy entity types to limit the
state-of-the-art methods across all evaluation metrics, in especial final performance.
the MRR and Hits@k scores of MINERVA, a path-based KGC model
which is similar to that of PRANN. It is notable that on the KG such 3. More efficient attention mechanism: More flexible and ef-
as FB15k-237 with a large number of diverse relations, PRANN fective attention mechanism over mh-KGC tasks need to be ex-
performs better compared to other models in the experiment. plored. For example, previous methods often applied a similar
On the contrary, MINERVA [63] was giving slightly better results approach using the dot product to measure the match between
on the dataset with a fewer number of relations, such as the weighted path vectors and a target relation, although calculat-
Countries dataset. From Table 19 we can observe the experimen- ing the dot product attention is faster and space-efficient, in
tal results on the small datasets, Kinship, and Countries. PRANN some cases, more intelligent handling technologies such as addi-
also achieves excellent results on the Kinship dataset because this tional scaling factors are needed to compute the correct attention
weights. For example, in [175], an additive attention function
dataset was created to evaluate the reasoning ability of logic rule
using a feed-forward network that scales well to smaller values
learning systems with more predictable paths compared to other
are applied to act attention mechanism, it exhibits better perfor-
datasets. However, on the Countries dataset, PRANN shows lower
mance compared to the dot product and efficiently scales to large
results compared to MINERVA, relevant explanation is because
values [183]. What is more, it is fully differentiable and trained
the number of training triples in the Countries dataset is too small
with standard back-propagation.
to efficiently train our model.
(3) On WN18 and FB15K datasets: Table 20 presents the ex- 4. More efficient path encoder and more proper training ob-
perimental results on WN18 and FB15K, numbers in bold mean jective: Path reasoning in KG is still in continuous development,
the best results among all methods. The evaluation results of especially with the emergence of various coding structures, such
baselines are from their original work, and ‘‘–’’ in the table means as Bert, XLNet, etc. We can try to use more effective encoders to
there is no reported result in prior work. According to the ta- encode path features. In addition, when combined with the tradi-
ble, IRN significantly outperforms other baselines, regardless of tional methods, we can learn from previous experience (e.g., [166]
37
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 19
Experimental results of path-based KGC methods on NELL995, FB15k-237, Kinship and Countries datasets. The public performance data in this table comes from
[175]. Best results are in bold.
Model NELL995 FB15k-237 Kinship Countries
MRR Hits@1 Hits@3 MRR Hits@1 Hits@3 MRR Hits@1 Hits@3 MRR Hits@1 Hits@3
PRA [164] 0.696 0.637 0.747 0.412 0.322 0.331 0.799 0.699 0.896 0.739 0.577 0.9
Single-Model [167] 0.859 0.788 0.914 0.575 0.512 0.567 0.804 0.814 0.885 0.941 0.918 0.956
MINERVA [63] 0.879 0.813 0.931 0.615 0.49 0.659 0.824 0.71 0.937 0.96 0.925 0.995
PRANN [175] 0.898 0.838 0.951 0.66 0.544 0.708 0.952 0.918 0.984 0.947 0.916 0.986

Table 20
Link prediction results of path-based KGC methods on WN18 and FB15k datasets.
All the data in the table comes from [174]. Best results are in bold.
Model WN18 FB15k
Hits@10 MR Hits@10 MR
RTransE [39] – – 0.762 50
PTransE (ADD, 2-step) [174] 0.927 221 0.834 54
PTransE (MUL, 2-step) [174] 0.909 230 0.777 67
PTransE (ADD, 3-step) [12] 0.942 219 0.846 58 Fig. 26. An example of the robustness of rule reasoning shown in [191].
PTransD (ADD,2-step) [117] – – 0.925 21
RPE (ACOM) [172] – – 0.855 41
RPE (MCOM) [172] – – 0.817 43
IRN [169] 0.953 249 0.927 38 makes rule-based reasoning achieve high accuracy. Moreover,
OPTransE [174] 0.957 199 0.899 33 logical rules are interpretable enough to provide insight into the
results of reasoning, and in many cases, this excellent character
will lead to the robustness of the KGC transfer task. For example,
conducting rule reasoning over an increasing KG can avoid parts
of retraining work due to the addition of new nodes, which is
more adaptable than models modeled for certain entities within
a specific KG. Consider the scenario in Fig. 26, when we add
some new facts about more companies or locations to this KG,
the rules with respect to ‘HasOfficeInCountry’ will still be usefully
accurate without retraining. The same might not be workable for
methods that learn embeddings for specific KG entities, as is done
in TransE. In other words, logical rule-based learning can be applied
Fig. 25. Example of rules for KGC. The picture refers to [184]. to those ‘‘zero-shot’’ entities that cannot be seen during training.
The rules are manually or automatically constructed as various
logic formulas, each formula learns a weight by sampling or
counting grounding from existing KGs. These weighted formulas
has shown that the non-linear composition function outperforms
are viewed as the long-range interactions across several relations
linear functions (as used by them) for relation prediction tasks)
[185]. Manual rules are not suitable for large-scale KGs, on the
to select and expand the appropriate linear or non-linear model.
other hand, it is hard to cover all rules in the specific domain
KG by hand. Recently, rule mining has become a hot research
4.2. External extra information outside KGs
topic, since it can automatically induce logical rules from ground
In this section we comb KGC studies which exploit external facts, i.e., captures co-occurrences of frequent patterns in KGs to
information and mainly include two aspects: rule-based KGC determine logical rules [207,209] in a machine-readable format.
in Section 4.2.1 and third-party data source-auxiliary KGC in 4.2.1.2. Definition about logical rules based KGC. Formulaically, the
Section 4.2.2. KGC over rules we consider here consists of a query, an entity tail
that the query is about, and an entity head that is the answer to
4.2.1. Rule-based KGC the query [191]. The goal is to retrieve a ranked list of entities
Logical rules in KGs are non-negligible in that they can provide based on the query such that the desired answer (i.e., head) is
us expert and declarative information for KGC, they have been ranked as high as possible.
demonstrated to play a pivotal role in inference [185–187], and
hence are of critical importance to KGC. In this section we give a Formulation of Logical Rules: In terms of first-order logic [210,
systemic introduction of KGC tasks working with various rules, 211], given a logical rule, it is first instantiated with concrete
we also list a summary table for rule-based KGC methods as entities in the vocabulary E, resulting in a set of ground rules.
shown in Table 21. Suppose X is a countable set of variables, C is a countable set of
constants. A rule is of the form head ← body as follows formula,
4.2.1.1. Introduction of logical rules. An example of KGC with log- where head query(Y , X ) is an atom over R ∪ X ∪ C and body
ical rules is shown in Fig. 25. From a novel perspective [192], KGs Rn (Y , Zn ) ∧ ... ∧ R1 (Z1 , X ) is a conjunction of positive or negative
can be regards as a collection of conceptual knowledge, which can atoms over R ∪ X ∪ C .
be represented as a set of rules like BornIn(x, y) ∧ Country(y, z) →
Nationality(x, z), meaning that if person x was born in city y and y query(Y , X ) ← Rn (Y , Zn ) ∧ ... ∧ R1 (Z1 , X )
is just right in country z, then x is a citizen of z. Rules are explicit
where R1 , . . . , Rn are relations in the KGs.
knowledge (compared to a neural network), thus reasonable use
of logic rules is of great significance to handle problems in KGs. Ground Atom & Rule’s Grounding: A triple (ei , rk , ej ) can be
Rule-based KGC allows knowledge transfer for a specific domain taken as a ground atom which applies a relation rk to a pair of en-
by exploiting rules about the relevant domain of expertise, which tities ei and ej . When replacing all variables in a rule with concrete
38
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 21
Characteristics of introduced KGC methods using rules.
Model Technology Information Rules Dataset
Markov Logic Network (MLNs) series:
LRNNs [188] Standard feed-forward NN, First-order rules Function-free first-order logic 78 RL benchmarks
weighted first-order rules
MLN-based KGC Markov Logic Network, Rules – –
[189] mathematical axiom proof
ExpressGNN [190] GNNs+MLN, Logic rules, First order logical rules in MLN FB15K-237
solving zero-shot problem, entity information
EM algorithm,
mean-field approximation inference
End-to-end differentiable framework:
NuralLP [191] TensorLog, First-order logical Weighted chain-like logical rules WN18,
neural controller system (LSTM), rules FB15K,
attention mechanism FB15KSelected
RLvLR [192] Improves NuralLP, First-order logical CP rule: closed path rules FB75K,
RESCAL, rules WikiData
target oriented sampling
NTPs [193] RNN, First-order logical Function-free first-order logic rules, Countries,
backward chaining algorithm, rules parameterized rules, Kinship,
RBF kernel, unify rule, Nations,
ComplEx OR rule, UMLS
AND rule
NTP2.0 [194] NTPS, First-order logical Function-free first-order logic rules; Countries,
max pooling strategy, rules parameterized rules; Nations,
Hierarchical Navigable Small World (HNSW, a unify rule; Kinship,
ANNS structure) OR rule; UMLS
AND rule
DRUM [184] Open World Assumption, First-order logical – Family,
confidence score, rules UMLS,
BIRNN Kinship
Combining rule and embedding approach:
a. A shallow interaction:
r-KGE [185] ILP, Logical rules, Rule 1 (noisy observation); Location,
RESCAL/TRESCAL/TransE, physical rules Rule 2 (argument type expectation); Sport
four rules Rule 3 (at-most-one restraint);
Rule 4 (simple implication).
INS [195] MLNs, Paths, path rules FB15K
INS-ES, rules
TransE
ProRR-MF [196] ProPPR, First-order logical First-order logical rules FB15K,
matrix factorization, rules WordNet
BPR loss
b. Explore further combination style:
KALE [197] Translation hypothesis, Logic rules Horn logical rules WN18,
t-norm fuzzy logic FB122
Trans-rule [198] TransE/TransH/TransR, First-order logical Inference rules; WN18,
first-order logic space transformer, rules transitivity rules; FB166, FB15K
encode the rules in vector space, antisymmetry rules
confidence score with a threshold
c. Iteration interactions:
RUGE [199] Iterative model, Soft rules, Soft rules; FB15K,
soft label prediction, logic rules Horn logical rules YAGO37
embedding rectification,
confidence score
ItRI [200] KG embedding model, Feedback information Non-monotonic rules with negated FB15K,
iteratively learning, of KG embedding atoms; Wiki44K
pruning strategy, model text corpus, non-monotonic rules with
hybrid rule confidence measures non-monotonic rules partially-grounded atoms
IterE [201] Iterative model, OWL2 Language, 7 types of object property expression; WN18-s,
embedding representation, axioms information ontology axioms; WN18RR-s,
axiom induction, Horn logical rules FB15k-s,
axiom injection, FB15k-237-sa
confidence score,
linear mapping hypothesis

(continued on next page)

39
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 21 (continued).
Model Technology Information Rules Dataset
pLogicNet [202] MLN, First order logical First order logical rules in MLN: FB15k,
EM algorithm, rules Composition Rules, FB15k-237,
amortized mean-field inference, Inverse Rules, WN18,
KG embedding model Symmetric Rules, WN18RR
(TransE/ComplEx) Subrelation Rules
Text + Logic rules:
FEP-AdTE [203] Knowledge verification system, Logical information, First-order logic rules FGCNb KGs
TEP-based abductive text evidence, text information
remote supervision
Rules + Paths + Embedding approaches:
AnyBURL [204] Aleph’s bottom-up rule learning Fuzzy rules, Straight ground path rule: FB15(k),
uncertain rules, AC1 rules, FB15-237,
path AC2 rules, WN18,
C rules WN18RR,
YAGO03-10
ELPKG [205] KGE model, Path information, Probabilistic soft logic rules YAGO,
breadth first search for paths, logic rules NELL,
probabilistic logical framework PSL YAGO-50,
YAGO-rest
RPJE [206] KGE model, Logical rules, Horn rules for two modules: FB15K,
confidence score, path R1 : relation pairs association, FB15K-237,
compositional representation learning R2 : paths composition WN18, NELL-995
Filtering candidate triples:
AMIE+ [207] Open-world assumption, First-order logical Single chain of variable rules YAGO2 core,
pruning operations rules for Confidence approximation; YAGO2s,
PCA; DBpedia 2.0,
typed rules DBpedia 3.8,
Wikidata
CHAI [26] Complex rules normalizer Rules Complex rules base on relation FB13,
domain and distance; WN18,
4 types filtering candidates criteria NELL,
EPSRC
About evaluation:
RuleN [208] An unify evaluation framework, Logical rules Path rules Pn ; WN18,
evaluated with AMIE model C rules FB15k,
FB15k-237
a
‘-s’ means the ‘-sparse’ series datasets.
b
The ‘FGCN’ means Four Great Chinese Novels in China.

entities in KG, we get a grounding of the rule. A logical rule is (1). Inductive logic programming (ILP) for rule mining:
encoded, for example, in the form of ∀x, y : (x, rs , y) → (x, rt , y),
Inductive logic programming (ILP) [213] (i.e. XAIL) is a type
reflecting that any two entities linked by relation rs should also be
of classical statistical relational learning (SRL) [214], it proposes
linked by relation rt [197]. For example, a universally quantified new logical rules and is commonly used to mine logical rules
rule ∀x, y : (x, CapitalOf , y) → (x, LocatedIn, y) might be instan- from KGs. Although ILP is a mature field, mining logical rules
tiated with the concrete entities of Paris and France, forming the from KGs is difficult because of the open-world assumption KGs
ground rule (Paris, CapitalOf , France) → (Paris, LocatedIn, France). abide by, which means that absent information cannot be taken
A grounding with all triples existing in the KG is a support of this as counterexamples.
rule, and the ground rule can then be interpreted as a complex (2). Markov Logic Networks (MLNs) and its extensions:
formula, constructed by combining ground atoms with logical Often the underlying logic is a probabilistic logic, such as
connectives (e.g. ∧ and →). Markov Logic Networks (MLNs) [215] or ProPPR [216]. The
Logical Rules for KGC: To reason over KGs, for each query it is advantage of using probabilistic logic is that by equipping logical
usually interested in learning weighted chain-like rules of a form rules with probability, one can better statistically model complex
similar to stochastic logic programs [212]: and noisy data [191].
MLNs combines hard logic rules and probabilistic graphical
α query(Y , X ) ← Rn (Y , Zn ) ∧ ... ∧ R1 (Z1 , X ) models. The logic rules incorporate prior knowledge and allow
MLNs to generalize in tasks with a small amount of labeled
where α ∈ [0, 1] means the confidence associated with this rule. data, while the graphical model formalism provides a principled
In a generic sense, the inference procedure will define the score framework for dealing with uncertainty in data. However, infer-
of each y implies query (y, x) as the sum of the confidence of the ence in MLN is computationally intensive, typically exponential
rules for the given entity x, and we will return a ranked list of in the number of entities, limiting the real-world application of
entities where higher the score implies higher the ranking [191]. MLN. Also, logic rules can only cover a small part of the possible
4.2.1.3. Rule mining. Inferring the missing facts among existing combinations of KG relations, hence limiting the application of
entities and relations in the growing KG by rule-based inference models that are purely based on logic rules.
approaches has become a hot research topic, and how to learn Lifted Relational Neural Networks (LRNNs) [188] is a lifted
the rules used for KGC also catches people’s eye. There is a lot of model that exploits weighted first-order rules and a set of rela-
literature that takes many interests in rule learning technology. tional facts work together for defining a standard feed-forward
40
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

neural network, where the weight of rules can be learned by networks following a backward chaining algorithm referred in
stochastic gradient descent and it constructs a separate ground Prolog, performing inference by recursively modeling transitiv-
neural network for each example. ity relations between facts represented with vectors or tensors
using RNN. NTPs makes full use of the similarity of similar sub-
A Theoretical study of MLN-based KGC (MLN-based KGC) [189]
symbolic representations in vector space to prove queries and
explores the possibility that using MLN for KGC under the maxi-
induce function-free first-order logical rules, the learned rules are
mum likelihood estimation, it discusses the applicability of learn-
ing the weights of MLN from KGs in the case of missing data used to perform KGC even further. Although NTPs demonstrates
theoretically. In this work, it is proved by mathematical axiom better results than ComplEx in a majority of evaluation datasets,
proof that the original method, which takes the weight of MLNs it has less scalability compared to NeuralLP as the limitation of
learning on a given and incomplete KG as meaningful and correct computation complexity which considers all the proof paths for
(i.e. using the so-called closed world assumption), and predicts each given query.
the learned MLN on the same open KGs to infer the missing facts NTP 2.0 [194] whereby scales up NTPS to deal with real-world
is feasible. Based on the assumption that the missing triples are datasets cannot be handled before. After constructing the compu-
independent and have the same probability, this paper points out tation graph as same as NTPs, NTPs 2.0 employs a pooling strategy
that the necessary condition for the original reasoning method is to only concentrate on the most promising proof paths, reduc-
that the learning distribution represented by MLN should be as ing the solutions searching procedure into an Approximate Near-
close as possible to the data generating distribution. In particular, est Neighbor Search (ANNS) problem using Hierarchical Navigable
maximizing the log-likelihood of training data should lead to Small World (HNSW) [222,223].
maximizing the expected log-likelihood of the MLN model.
DRUM [184], an extensible and differentiable first-order logic rule
ExpressGNN [190] explores the combination of MLNs and popular mining algorithm, further improves NeuralLP by learning the rule
GNNs in KGC field, and applies GNNs into MLN variational reason- structure and the confidence score corresponding to the rule, and
ing. It uses GNNs to explicitly capture the structural knowledge establishes a connection between each rule and the confidence
encoded in the KG to supplement the knowledge in the logic score learned by tensor approximation, uses BIRNN to share use-
formula for predicting tasks. The compact GNNs allocates simi- ful information when learning rules. Although it makes up for
lar embedding to similar entities in the KG, while the express- the shortcomings of the previous inductive LP methods that have
ible adjustable embedding provides additional model capacity to poor interpretability and cannot infer unknown entities, DRUM
encode specific entity information outside the graph structure. is still developed on the basis of the Open World Assumption
ExpressGNN overcomes the scalability challenge of MLNs through of KGs and is limited to positive examples in training. In the
efficient stochastic training algorithm, compact posterior param- following research, it is necessary to further explore improved
eterization and GNNs. A large number of experiments show that
DRUM methods suitable for negative sampling, or try to explore
ExpressGNN can effectively carry out probabilistic logic reason-
the same combination of representation learning and differential
ing, and make full use of the prior knowledge encoded within
rule mining as methods [63,224].
logic rules while meet data-driven requirement. It achieves a
good balance between the representation ability and the simplic- 4.2.1.4. Combining rule-based KGC models with KGE models. The
ity of the model. In addition, it not only can solve the zero-shot rule-based KGC models provide interpretable reasoning and allows
problem, but also is a general enough which can balance the domain-specific knowledge transfer by using the rules about re-
compactness and expressiveness of the model by adjusting the lated professional fields. Compared to the representation model,
dimensions of GNNs and embedding. the rule-based models do not need a lot of high-quality data but
(3). End-to-end differentiable rule-based KGC methods: can achieve high accuracy and strong interpretability. However,
Based on these proposed basic rule-mining theories, a large they often face efficiency problems in large-scale search space;
amount of end-to-end differentiable rule-based KGC methods are while the embedding-based KGC models, i.e., the KGE models,
developed according to these types of rules. have higher scalability and efficiency but they have a flaw in
Neural Logic Programming (NeuralLP) [191] is an end-to-end dealing with sparse data due to their great dependence on data.
differentiable framework which combines first-order rules in- We summarize the advantages and disadvantages of rule-based
ference and sparse matrix multiplication, thus it allow us learn KGC and embedding-based KGC methods in a simplified table
parameters and structure of logical rules simultaneously. Addi- (Table 22). Therefore, there is no doubt that combining rule-based
tionally, this work establishes a neural controller system using reasoning with KGE models to conduct KGC will be noteworthy.
attention mechanism to properly allot confidences to the logical Please see Fig. 27 for a rough understanding of the researches of
rules in the semantic level, rather than merely ‘‘softly’’ generate combining rule information with KGE models.
approximate rules as mentioned in previous works [217–220], (1). A shallow interaction:
and the main function of the neural controller system is con- There are already some simple integrating works in the earlier
trolling the composition procedure of primitive differentiable attempts:
operations of TensorLog [221] in the memory of LSTM to learn r-KGE [185] is one of these methods, it tries to utilize ILP to in-
variable rule lengths. tegrate the embedding model (three embedding models: RESCAL,
RLvLR [192] aims at tackling the main challenges in the scalability TRESCAL, TransE) and four rules (including logical rules and phys-
of rule mining. Learning rules from KGs with the RESCAL embed- ical rules): rules are expressed as constraints of the maximization
ding technique, RLvLR guides rules mining by exploring in pred- problem, by which the size of embedding space is greatly re-
icates and arguments embedding space. A new target-oriented duced. r-KGE employs relaxation variables to model the noise
sampling method makes huge contributions to the scalability of explicitly, and a simple noise reduction method is used to reduce
RLvLR in inferring over large KGs, and the assessment work for the noise of KGs. But there are some disadvantages in this work:
candidate rules is handled by a suit of matrix operations referred it cannot solve n − to − n relations and the reasoning process is
to [207,209]. RLvLR shows a good performance both in the rules’s too time-consuming, especially for large KGs, which makes the
quality and the system scalability compared with NeuralLP. algorithm has poor scalability.
NTPs [193] is similar to NeuralLP, it focuses on the fusion of INS [195] is a data-driven inference method which naturally
neural networks and rule inferring as well, but models neural incorporates the logic rules and TransE together through MLNs
41
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 22
Statistics of pros and cons of rule-based KGC methods and embedding-based KGC methods.
Category Advantage Disadvantage
Rule-based KGC 1. Consider explicit logical semantics 1. Poor Scalability
2. Strong explainability and accuracy 2. Noise sensitive
3. Data dependency 3. High computational complexity
4. Can be applied to both transductive and inductive problems
5. High robustness avoiding re-training
Embedding-based KGC 1. High scalability 1. Data-driven
2. High efficiency 2. Poor explainability
3. Not affected by huge candidate sets 3. Hard to model the interaction of different relations
4. Cannot handle inductive scenarios

Fig. 27. Several typical KGC models which combine logical rules and embedding models, the development from (a) to (d) shows the process of deepening interaction
between rules and embedding models.
Source: These pictures are extracted from [52,185,197,199]

to conduct KGC, where TransE calculates the similarity score work is the first formal research on low dimensional embedding
between the candidate and the correct tag, so as to take the learning of first-order logic rules. However, it is still in a dilemma
top-N instances selection to form a smaller new candidate set, in predicting new knowledge since it has not combines entity,
which not only filters out the useless noise candidates, but also relation and rule embedding to cooperate symbolic reasoning
improves the efficiency of the reasoning algorithm. The calculated with statistical reasoning.
similarity score is used as a priori knowledge to promote further Nevertheless, although these several KGC methods jointly
reasoning. For these selected candidate cases, INS and its im- model with logical rules and embeddings, the rules involved in
proved version INS-ES [195] algorithm adopted in MLN network them are used merely as the post-processing of the embedding
is proposed to consider the probability of transition between methods, which leads to less advance in the generation of better
network sampling states during reasoning, therefore, the whole embedding representation [197].
reasoning process turns into supervised. It is worth noting that (2). Explore further combination style:
INS greatly improves the Hits@1 score in FB15K dataset. Different from previous approaches, the latter literatures ex-
pect to explore more meaningful combination ways, rather than
A Matrix Factorization Based Algorithm utilizing ProPPR
just jointly working on the surface level.
(ProRR-MF) [196] tries to construct continuous low dimensional
embedding representation for first-order logics from scratch, and KALE [197] is a very simple KGC model which combines the
is interested in learning the potential and distributed represen- embedding model with the logical rules, but pays attention to
tation of horn clause. It uses scalable probabilistic logic structure the deep interaction between rules and embedding methods. The
(ProPPR in [216]) learning to construct expressive and learnable main idea of KALE is to represent triples and rules in a unified
logic formulas from the large noisy real-world KGs, and applies framework, in which triples are represented by atomic formulas
a matrix factorization method to learn formula embedding. This and modeled by translation hypothesis; rules are represented by
42
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

complex formulas and modeled by t-norm fuzzy logic. Embedding the sparsity of KGs but also pays attention to the influence of
can minimize the overall loss of atomic formulas and complex semantics on rules. IterE proposes a new form of combining rule
formulas. In particular, it enhances the prediction ability of new and embedding representation, which provides a new idea for
facts that cannot be inferred directly from pure logic inference, KGC research combining different types of methods.
and it has strong generality for rules.
pLogicNet proposed by [202] is the product of cooperation be-
Trans-rule [198] is also a translation-based KG embedding and tween KG embedded model and MLN logic rules. Similar to IterE,
logic rules associative mode, what distinguishes this model from the operation process of pLogicNet is also carried out under the
previous similar works is that, it concerns about rules having deep interaction between embedding and rules. The difference is
confidence above a threshold, including inference rules, transitivity that in pLogicNet, a first-order Markov logic network is used to
rules and antisymmetry rules, these rules and their confidences define the joint distribution of all possible triples, then applies the
are automatically mined from triples in KG, then they are placed variant algorithm of EM algorithm to optimize pLogicNet. In the E
together with triples into a unify first-order logic space which al- step of the variant EM algorithm, the probability of unobserved
low rules encoded in it. Additionally, to avoid algebraic operations triples is deduced by using amortized mean-field inference, and
inconsistency problem, it maps all triples into first-order logics, the variation distribution is parameterized as the parameter of
and also defines kinds of interaction operations for rules to keep the KG embedding model; in M-step, the weights of the logic
the form of rules encoding to 1-to-1 mapping relation. rules are updated by defining the pseudo-likelihood on both the
(3) Iteration interactions observed triples and the triples inferred from the embedding
With the emergence of new completion demands, a new way model. PLogicNet can effectively use the stochastic gradient de-
of jointly learn rules and embeddings for KGC in a iteration scent algorithm to train. The training process iteratively performs
manner comes into being. E-step and M-step until convergence, and the convergence speed
of the algorithm is very satisfactory.
RUGE [199] is a novel paradigm of KG embedding model which
combines the embedding model with logic rules and exploits
4.2.1.5. Cooperating rules with other information. (1) Cooperating
guidance from soft rules in an iterative way. RUGE enables the
with abductive text evidence
embedding model to learn both labeled and unlabeled triples in
exiting KG, the soft rules with different confidence levels can be TEP-based Abductive Text Evidence for KGC (FEP-AdTE) [203]
acquired automatically from the KG at the same time. Vary from combines logical information and text information to form a new
the previous studies, this work first applies a iterative manner knowledge verification system, adding new fact triples to KGs.
to deeply capture the interactive nature between embedding The main idea of this paper is to define the explanation of triples
learning and logical inference. The iterative procedure can auto- — the form of (triples, windows) abductive text evidence based
matically extracted beneficial soft rules without extensive manual on TEP, in which the sentence window w explains the degree of
effort that are needed in the conventional attempts which always the existence of the triple τ , and uses the remote supervision
use hard rules in a one-time injection manner. Each iteration method in relation extraction to estimate the abductive text
contains two stage: soft label prediction and embedding rectifi- evidence. FEP-AdTE considers only the subset-minimal abductive
cation, the two partial responsible for approximately reasoning, explanation (called Mina explanation) to make the explanation
predicting and updating the KG with the newly predicted triples as concise as possible and applies the hypothesis constraint to
for further better embeddings in the next iteration respectively. limit the number of Mina explanations to be calculated to make
Though the whole iteration procedure, this flexible approach can the interpretation work possible. It is worth mentioning that this
fully divert the rich knowledge contained in logic rules to the paper has developed KGs corresponding to the text corpus of
learned embeddings. Moreover, RUGE demonstrates the useful- four Chinese classics to evaluate the new knowledge verification
ness of automatically extracted soft rules according a series of mechanism of this paper. However, the triple interpretation in
experiments. this paper does not contain valuable entity-type attributes. In
future work, we can consider adding pragmatic interpretation
Iterative Rules Inducing (ItRI) [200] iteratively extends induced
of entity types to further enhance the verification effect of new
rules guided by feedback information of the KG embedding model
knowledge and make contributions to KGC.
calculated in advance (including probabilistic representations of
(2) Cooperating with path evidence
missing facts) as well as external information sources, such as text
corpus, thus the devised approach not only learns high quality AnyBURL [204] can learn logic rules from large KGs in a bottom-
rules, but also avoids scalability problems. Moreover, this machin- up manner at any time. AnyBURL is further designed as an ef-
ery is more expressive through supporting non-monotonic rules fective KG rule miner, the concept of example is based on the
of negated atoms and partially grounded atoms. interpretation of path in KGs, which indicates that KGs can be
formed into a group of paths with edge marks. In addition, Any-
IterE [201] recursively combines the embedding model and rules
BURL learns fuzzy, uncertain rules. Because the candidate ranking
to learn the embedding representation as well as logic rules.
can be explained by the rules that generate the ranking, AnyBURL
IterE mainly consists of three parts: embedding representation,
has good explanatory power. In addition to the other advantages
axiom induction, axiom injection, and the training is carried out
of rule-based KGC, the additional advantages of AnyBURL are its
by interactive iteration among these three parts so that rules
fast running speed and less use of resources. In addition, AnyBURL
and embedding can promote each other to the greatest extent,
proves that rule learning can be effectively applied to larger KBs,
forming the final reasoning framework. Specifically, on the one
which overturns the previous bias against the rule-based KGC
hand, the embedding model learns from the existing triples in
method.
KGs as well as the triples inferred from the rules. On the other
hand, the confidence score of axioms derived from the pruning ELPKG [205] combines path information, embedding representa-
strategy should be calculated on the learned relational embed- tion and soft probability logic rules together. In a word, the KG
dings according to the linear mapping hypothesis, and then new embedding model is used to train the representation of inter-
triples can be inferred by the axioms. Finally, the new triples entity relation, and breadth-first search is used to find the path
are linked into KGs for following entity embedding learning. between entity nodes. The representation of entity/relation based
The recursive operation designed by IterE not only alleviates on path information is combined with the representation based
43
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 23 minConf and conduct confidence approximations that allow the


Candidate filtering works in rule reasoning. system to explore the search space much more efficiently.
Model Candidate filtering strategies
AMIE, AMIE+ [207] Maximum rule length,
Inferring via Grounding Network Sampling (INS) [195] employs
Perfect rules, an embedding-based model (TransE) to conduct the instance
Simplifying projection queries, selection and form much smaller candidate sets for subsequent
confidence threshold minConf fact inference, whose aim is not only narrowing the candidate sets
INS [195] Instance selection using TransE but also filtering out part of the noise instances.
NTP 2.0 [194] Approximate Nearest Neighbor Search
NTP 2.0 [194] shows that searching answer facts over KGs that
RLvLR [192] MinSC and MinHC best explain a query can be reduced to a k-nearest neighbor
IterE [201] Traversing and random selection problem, for which efficient exact and approximate solutions
DRUM [184], Confidence measure for pruning rules exist [79].
RUGE [199],
Trans-rule [198], RLvLR [192] sets the MinSC and MinHC which represent the
RPJE [206], minimum values of standard confidence and head coverage for
ItRI [200] learned rules, respectively, to further filter the candidate rules.
CHAI [26] Filtering candidates criteria in KGs:
existsKG ((h, r , t)) ↔ ∃e ∈ E |(h, r , e) ∈ T ,
IterE [201] utilizes a pruning strategy combining traversing and
domKG,rel ((h, r , t)) ↔ ∃e ∈ E |(t , rel, e) ∈ T , random selection to generate a pool of possible axioms and then
ranKG,rel ((h, r , t)) ↔ ∃e ∈ E |(e, rel, t) ∈ T , assigns a score to each axiom in the pool based on a calculation
distanceKG,i ((h, r , t)) ↔ dist(KG, h, t) ⩽ i between relation embeddings according to rule conclusions from
linear map assumption.
The work in [184,198–200,206] tend to devise the confidence
on the embedding vector to generate relational representation measures that capture rule quality better for pruning out not
between entities. On this basis, the probability soft logic is applied promising rules, thus improve the ranking of rules.
to deduce and predict the relation probability between entities CHAI [26] At the same time, CHAI focuses on the filtering method
to perform KGC, which solves the problems of knowledge in- of candidate triples in the KGC process. It points out that the
consistency and knowledge conflict. Finally, the method is used previous KGC method considers all candidate triples or filters
to complete the relation between KG entities. ELPKG not only
candidate sets roughly, which is not reasonable. To solve these
ensures the efficiency of it but also shows the high accuracy of
problems, CHAI considers more complex rules based on relation
LP. Because it makes full use of the existing facts of KG, it does
domain and distance to normalize the candidate set and effec-
not need external auxiliary knowledge.
tively selects the most promising candidate triples to form the
RPJE [206] also combines path and semantic level associate rela- smallest candidate set, so as to improve the performance of KGC.
tions by Horn rules. Firstly, it mines and encodes logical rules of Although this method provides a good idea for filtering candidate
Horn sub-sentence forms with different lengths from the knowl- triples, it is not suitable for large relational KGs and sparse KGs,
edge graph, and then uses the rules with length 2 to accurately which can be further improved in the future. In the experiment,
combine paths, and explicitly makes length 1 rules to create a it is compared with [25], whose candidate set filtering proposal
semantic association between relations and constrain relations is replacing the target entity with the entities within the range
vector representation. In addition, the confidence degree of each of all relations of the existing triples, so as to generate candidate
rule is also considered in the optimization process to ensure that triples.
the rule should be effective in representation learning. RPJE com-
bines logic rules and paths to embed KG, which fully benefits the
4.2.1.7. Evaluation and datasets of rule-based KGC methods. About
interpretability and accuracy of logic rules-based KGC methods,
Evaluation: Mining rules have traditionally relied on predefined
the generalization of KG embedding, and the semantic structure
statistical measures such as support and confidence to assess
information provided by paths. The combination strategy of this
paper is simple so that it is worth trying to adopt more complex the quality of rules [192]. These are fixed heuristic measures.
combination methods, such as using the LSTM with an attention For example, to assess the quality of mined rules, the common
mechanism suitable for long-path modeling. In addition, learn measures that are used to rule learning mostly evaluate can-
from the interaction between embedding and rules in IterE and didates rules according to their Standard Confidence (SC) and
pLogicNet to explore how to use a well-designed closed-loop Head Coverage (HC). If entity pair (e, e′ ) satisfies the body of
system to push embedded information back from RPJE to rule r (denoted as body(r)(e, e′ )), and (e, e′ ) satisfies the head of
learning also deserves people’s attention. r (denoted as Rt (e, e′ )), for the entities e1 , . . . , en−1 and the
facts R1 (e, e1 ), R2 (e1 , e2 ), . . . , Rn (en−1 , e′ ) in KG, when there exists
4.2.1.6. Candidate filtering in rule reasoning. Some rules are pro- Rt (e, e′ ) in the KG, the computation of SC and HC are as follows:
posed for filtering candidate triples (called filtering rules) in the
context of the KGC process by combining a number of criteria in supp(r)
SC (r) =
such a way that it optimizes a given fitness function, the produced #(e, e′ ) : body(r)(e, e′ )
rules can be applied to the initial set of candidates and generate supp(r)
a reduced set that contains only the more promising candidate HC (r) =
triples rather than using the full set possible missing candidate #(e, e′ ) : Rt (e, e′ )
triples (and thus provide no filtering) or applying very basic rules where supp(r) is the support degree of rule r:
to filter out unlikely candidates most current approaches do,
which may have a negative effect on the completion performance supp(r) = #(e, e′ ) : body(r)(e, e′ ) ∧ Rt (e, e′ )
as very few candidate triples are filtered out [26]. A summary Whereas these measures maybe are not optimal for various use
table about candidate filtering are listed as Table 23. cases in which one might want to use the rules. For instance,
AMIE+ [207] presents a series of pruning strategies including for- using SC is not necessarily optimal for statistical relational learn-
mulating Maximum rule length, Perfect rules and Simplifying projec- ing. Therefore, the work in [207] develops PCA confidence to allow
tion queries. Besides, they prune rules with a confidence threshold the counterexamples generation in a less restrictive way than SC.
44
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 25
Table 24 Statistics about other datasets for KGC using rules.
An KGC example using rules referred to [208]. In this instance, four
Dataset Entity Relation Fact
relevant rules for the completion task (h, r , ?) resulting in the ranking
(g(0.81), d(0.81), e(0.23), f (0.23), c(0.15)). A rule can generate one candidate #Train #Valid #Test
(fourth row), several candidates (first and third row), or no candidate (second NELL-995 [206] 75,492 200 123,370 15,000 15,838
row). DRC [203] 388 45 333 – 34 530
Rule Type Confidence Result JW [203] 104 21 106 – 27 670
r(x, y) ≤ s(y, x) P1 0.81 {d,g} OM [203] 156 38 178 – 34 010
r(x, y) ≤ r(y, x) P1 0.7 φ RTK [203] 123 30 132 – 29 817
r(x, y) ≤ t(x, z) ∧ u(z , y) P2 0.23 {e,f,g} FB122 [197] 9738 122 91,638 9595 5057+6186
r(x, c) ≤ ∃yr(x, y) C 0.15 {c} FB166 [198] 9658 166 100,289 10,457 12,327
YAGO [205] 192 628 51 192 900
NELL [205] 2 156 462 50 2 465 372
YAGO-50 [205] 192 628 50 100 774
YAGO-rest [205] 192 628 41 92 126
Sport [185] 447 5 710
Besides, the work in [184] uses two theorems to learn rule struc- Location [185] 195 5 231
tures and appropriate scores simultaneously. However, this is a Countries [225] 244+23 5 1158
challenge because the method needs to find an optimal structure
in a large discrete space and simultaneously learn proper score
values in a continuous space. Due to the process of evaluating
candidate rules in a rule mining system is generally challenging are constructed from existing text corpora in a domain about
and time-consuming, [192] reduces its computation to a series character relationships in the four great classical masterpieces of
of matrix operations. This efficient rule evaluating mechanism Chinese literature, namely Dream of the Red Chamber (DRC), Jour-
allows the rule mining system to handle massive benchmarks ney to the West (JW), Outlaws of the Marsh (OM), and Romance of
efficiently. Meilicke et al. [208] presents a unified fine-grained the Three Kingdoms (RTK) [203]. Triples in those KGs are collected
evaluation framework that commonly assesses rule-based infer- on character relationships from e-books for these masterpieces,
ring models over the datasets generally used for embedding- yielding four KGs each of which corresponds to one masterpiece.
based models, making the effort to observe the valuable rules and
interesting experiences for KGC. Consider the rule’s confidence 4.2.1.8. Analysis of rule-based KGC methods. In summary, we ana-
as well, since when we use relevant rules for the complete task lyze some tips about experiment rule-based KGC methods on the
(h, r , ?), a rule can generate a variable number of candidate, and common benchmark. Referring to the generated results in [208],
the possible ways of aggregating the results generated by the which allow for a more comprehensive comparison between var-
rules are various. The work in [208] defines the final score of ious rule-based methods and embedding-based approaches for
an entity as the maximum confidence scores of all rules that KGC, employing a global measure to rank the different methods.
generated this entity. Furthermore, if a candidate has been gen- On this basis, we gained several interesting insights:
erated by more than one rule, they use the amount of these 1. Both AMIE and RuleN perform competitively to embedding-
rules as a secondary sorting attribute among candidates with the based approaches for the most common benchmarks. This holds
same (maximum) score. For instance in the Table 24, if there for the large majority of models reported about in [226]. Only a
are four relevant rules for completing (h, r , ?) and resulting in few of these embedding models perform slightly better.
the final ranking (g(0.81), d(0.81), e(0.23), f (0.23), c(0.15)). To 2. Since the rule-based approaches can deliver an explanation for
support the evaluation system, this paper designs a simplified the resulted ranking, the characteristic can be helpful to conduct
rule-based model called RuleN for assessing experiments and fine-grained evaluations and understand the regularities within
evaluated together with the AMIE model. With the inspiring and the hardness of a dataset [204].
results of experiments showing that models integrating multiple 3. The traditional embedding-based KGC methods may have mat-
different types of KGC approach deserve to be attracted attention ters in solving specific types of completion tasks whereas it can
in KGC task, this paper further classifies test cases of datasets for be solved easily with rule-based approaches, this tip becomes
fine-grained evaluation according to the interpretation generated
even more important when the situations looking solely at the
by the rule-based method, then gets a series of observations
top candidate of the filtered ranking.
about the partitioning of test cases in datasets.
4. One reason for the good results of rule-based systems is the
Datasets: Table 25 list the basic statistics information about com- fact that most standard datasets are dominated by rules such as
mon used datasets for rule-based KGC research. Here we intro- symmetry and (inverse) equivalence (except for those especially
duce several datasets in detail. constructed datasets, e.g., FB15k-237).
NELL: NELL datasets (http://rtw.ml.cmu.edu/rtw/resources) and 5. It is quite possible to leverage both families of approaches
its subsets are likely to be used as experimental data, including by learning an ensemble [185,195–199,202] to achieve better
NELL-995 [206], Location and Sport [185]. results than any of its members. The overall ensemble models
tend to contain a closed-loop operation, which indicates that the
FB122: composed of 122 Freebase relations [197] regarding the embedding expression and rules are mutual achievements with
topics of ‘‘people’’, ‘‘location’’, and ‘‘sports’’, extracted from FB15K.
each other. In the future, it is necessary to explore more effective
FB122’s test set are further split into two parts test-I and test-II,
interaction ways for integrating these two categories approaches.
where the former contains triples that cannot be directly inferred
6. Recently, novel effective but complex KG encoding models
by pure logical inference, and the latter the remaining test triples.
emerge in endlessly, which also provides alternative techniques
Countries: a dataset introduced by [225] for testing reasoning for KGC to combine knowledge embedding and rules in the future.
capabilities of neural link prediction models [193]. Triples in
Countries are (countries(c), regions(r), subregions(sr)) and they are 4.2.2. Third-party data sources-based KGC
divided into train, dev and test datasets which contain 204, 20 Some related techniques learn entity/relation embeddings
and 20 countries data. from triples in a KG jointly with third-party data sources, in
KGs about Four great classical masterpieces of Chinese liter- particular with the additional textual corpus (e.g., Wikipedia
ature (FGCN): new KGs and the corresponding logical theories articles) for getting help from related rich semantic information.
45
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 26
Statistics of popular KGC models using third-party data sources.
Models Technology Information (Data Source) Datasets
Joint alignment model:
JointAS [15] TransE, skip-gram, Structural information, Freebase subset;
words co-occurrence, entities names, English Wikipedia
entity-words co-occurrence Wikipedia anchors
JointTS [121] TransE, skip-gram, Structural information, FB15K,
JointAS entities names Freebase subset;
Wikipedia text descriptions, Wikipedia articles
textual corpus
DKRL [94] TransE, CBOW, CNN, Structural information, FB15K,
max/mean-pooling multi-hop path, FB20K
entity descriptions
SSP [227] TransE, topic extraction, Structural information, FB15K;
Semantic hyperplane Projection entity descriptions Wikipedia corpuses
Prob-TransE (or TransE/TransD, CNN, Structural information, FB15K;
TransD) semantics-based attention mechanism entity descriptions, NYT-FB15K
JointE (or JointD) anchor text, textual corpus
[228]
JOINER [229] TransE, Structural information, Freebase subset;
regularization, textual corpus, English Wikipedia
JointAS Wikipedia anchors
ATE [230] TransE, BiLSTM, Skip-Gram, Relation mentions and entity descriptions, Freebase, WordNet;
mutual attention mechanism textual corpus English Wikipedia (Wiki)
aJOINT [162] TransE, KG structural information, WN11, WN18, FB13, FB15k;
collaborative attention mechanism textual corpus Wikipedia articles
KGC with Pre-trained Language Models (PLMs):
JointAS [15], DESP word2vec Structural information, textual information FB15K, FB20K
[121], DKRL [94]
LRAE [231] TransE, PCA, word2vec Structural information, FB15k,
entity descriptions WordNet
RLKB [232] Probabilistic model, Structural information, FB500K, EN15K
single-layer NN entity descriptions
Jointly-Model TransE, CBOW/LSTM, Attention, Structural information, FB15K,
[233] Gate Strategy entity descriptions WN18
KGloVe-literals Entity recognition, Textual information in properties, Cities, the AAUP, the Forbes,
[234] KGloVe textual corpus the Metacritic Movies,
the Metacritic Albums;
DBpedia abstracts
Context Graph Context graph, CBOW, Skip-Gram Analogy structure, DBpedia
Model [235] semantic regularities
KG-BERT [236] BERT, sequence classification Entity descriptions, WN11, FB13, FB15K,
entity/relation names, WN18RR, FB15k-237, UMLS;
sequence order in triples, textual corpus Wikipedia corpuses
KEPLER [237] RoBERTa [238], masked language modeling KG structural information, FB15K, WN18, FB15K-237,
(MLM) entity descriptions, WN18RR; Wikidata5M
textual corpus
BLP [239] BERT, holistic evaluation framework, inductive KG structural information, FB15K-237, WN18RR;
LP, TransE, entity descriptions, Wikidata5M
DistMult, ComplEx, and SimplE textual corpus
StAR [240] RoBERTa/BERT, KG structural information, WN18RR, FB15k-237,
multi-layer perceptron (MLP), entity descriptions, ULMS, NELL-One; Wikipedia
Siamese-style textual encoder textual corpus paragraph

Next, we will systematically introduce KGC studies that use third- distributional vector space through lexical-semantic analogical
party data source, we also list them in Table 26 for a direct inferences. Secondly, under the Open-world Assumption, a missing
presentation. fact often contains entities out of the KG, e.g., one or more entities
are phrases appearing in web text but not included in the KG
4.2.2.1. Research inspiration. This direction is inspired by these yet [15]. While only relying on the inner structure information
three key items: Firstly, pre-training language models (PLMs) is hard to model this scene, the third-party textual datasets can
such as Word2Vec [75], ELMo [241], GPT [242], and BERT [243], provide satisfied assistance for dealing with these out-of-KG facts.
have caused the upsurge in the field of natural language process- Thirdly, similar to the last point, auxiliary textual information
ing (NLP) which can effectively capture the semantic information such as entity descriptions can help to learn sparsity entities,
in text. They originated in a surprising found that word repre- which act as the supplementary information of these entities
sentations that are learned from a large training corpus display lacking sufficient messages in the KG to support learning.
semantic regularities in the form of linear vector translations The most striking textual information is entity description,
[75], for example, king − man + w oman ≈ queen. Such a struc- very few KGs contain a readily available short description or
ture is appealing because it provides an interpretation of the definition for each of the entities or phrases, such as WordNet
46
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

and Freebase, and usually it needs the additional lexical resources textual descriptions must interact with triples for better embed-
to provide textual training. For instance, in a medical dataset with ding. SSP can model the strong correlations between symbolic
many technical words, the Wikipedia pages, dictionary defini- triples and textual descriptions by performing the embedding
tions, or medical descriptions via a site such as ‘medilexicon.com’ process in a semantic subspace.
could be leveraged as lexical resources [236].
Prob-TransE and Prob-TransD [228] jointly learns the repre-
sentation of the entities, relations, and words within a unified
4.2.2.2. Joint alignment model. JointAS [15] jointly embeds entities parameter sharing semantic space. The KG embedding process
and words into the same continuous vector space. Entity names incorporates TransE and TransD (called Prob-TransE and Prob-
and Wikipedia anchors are utilized to align the embeddings of en- TransD) as representative in the framework to handle representa-
tities and words in the same space. Numerous scale experiments tion learning of KGs, while the stage of representation learning of
on Freebase and a Wikipedia/NY Times corpus show that jointly textual relations applies CNN to embed textual relations. A recip-
embedding brings promising improvement in the accuracy of rocal attention mechanism consists of knowledge based attention
predicting facts, compared to separately embedding KGs and text. and the semantics attention (SATT) are proposed to enhance the
Particularly, JointAS enables the prediction of facts containing KGC. The attention mechanism can be simply described as fol-
entities out of the KG, which cannot be handled by previous lows: during the KG embedding process, semantic information
embedding methods. The model is composed of three compo- extracted from text models can be used to help explicit relations
nents: the knowledge model LK , text model LT , and alignment to fit more reasonable entity pairs, similarly, additional logi-
model LA which make the use of entity names LAN and Wikipedia cal knowledge information can be utilized to enhance sentence
anchors LAA , thus the overall objective is to maximize this jointly embedding and reduce the disadvantageous influences of noisy
likelihood loss function: generated in the process of distant supervision. The experiments
use anchor text annotated in articles to align the entities in KG and
L = LK + LT + LA
entities mentions in the vocabulary of the text corpus, and build
where LA could be LAA or LAN or LAN + LAA , and the score function the alignment between relations in KGs and text corpus with the
s(w, v ) = b − 12 (∥w − v∥2 ) of a target word w appearing close to idea of distant supervision. A series of comparative experiments
a context word v (within a context window of a certain length) prove that the joint models (JointE+SATT and JointD+SATT) have
for text model while the score function s(h, r , t) = b − 12 (∥vh + effective performances through trained without strictly aligned
vr − vt ∥2 ) for KG model, in which the b is a bias constant. text corpus. In addition to that, this framework is adaptable and
Although this alignment model goes beyond previous KGE flexible which is open to existing models, for example, the partial
methods and can perform prediction on any candidate facts of TransE and TransD can be replaced by the other KG embedding
between entities/words/phrases, it has drawbacks: using entity methods similar to them such as TransH and TransR.
names severely pollutes the embeddings of words; using JOINER [229] jointly learns text and KG embeddings via regu-
Wikipedia anchors completely relies on the special data source larization. Preserving word–word co-occurrence in a text corpus
and hence the approach cannot be applied to other customer data. and transition relations between entities in a KG, JOINER also can
JointTS [121] takes these above-mentioned issues into consid- use regularization to flexibly control the amount of information
eration, without dependency on anchors, it improves alignment shared between the two data sources in the embedding learning
model LA based on text descriptions of entities by considering both process with significantly less computational overhead.
conditional probability of predicting a word w given entity e ATE [230] carries out KGE using both specific relation mention
and predicting a entity e when there is a word w . This model and entity description encoded with a BiLSTM module. A mutual
learns the embedding vector of an entity not only to fit the struc- attention mechanism between relation mentions and entity de-
tured constraints in KGs but also to be equal to the embedding scriptions is designed to learn more accurate text representation,
vector computed from the text description, hence it can deal to further improve the representation of KG. In the end, the final
with words/phrases beyond entities in KGs. Furthermore, the new entity and relation vectors are obtained by combining the learned
alignment model only relies on the description of entities, so that text representation and the previous traditional translation-based
it can obtain rich information from the text description, thus well representation. This paper also considers the fuzziness of entity
handles the issue of KG sparsity. and relation in the triple, filters out noisy text information to
DKRL [94] is the first work to build entity vectors directly ap- enrich KG embedding accurately.
plying entity description information. The model combines triple aJOINT [162] proposes a new cooperative attention mechanism,
information with entity description information to learn vectors based on this mechanism, a text-enhanced KGE model was pro-
for each entity. The model efficiently learns the semantic em- posed. Specifically, aJOINT enhances KG embeddings through the
bedding of entities and relations relying on the CBOW and CNN text semantic signal: the multi-directional signals between KGE
mechanism and encodes the original structure information of and text representation learning were fully integrated to learn
triples with the use of TransE. Experiments on both KGC and more accurate text representations, so as to further improve the
entity classification tasks verify the validity of the DKPL model in structure representation.
expressing new entities and dealing with zero-shooting cases. But
4.2.2.3. KGC with pre-trained language models. Recently,
it should not be underestimated that DKRL tune-up needs more
pre-trained language models (PLMs) such as ELMo [241],
hyper-parameters along with extra storage space for inner layers’
Word2Vec [75], GPT [242], BERT [243], and XLNet [244] have
parameters.
shown great success in NLP field, they can learn contextualized
Semantic Space Projection (SSP) [227] is a method for KGE with word embedding with large amount of free text data and achieve
text descriptions modifying TransH. SSP jointly learns from the excellent performance in many language understanding tasks
symbolic triples and textual descriptions, which builds interac- [236].
tion between these two information sources, at the same time According to the probable usage of PLMs in KGC tasks, the
textual descriptions are employed to discover semantic relevance related approaches can be roughly divided into two categories
and offer precise semantic embedding. This paper firmly con- [236]: feature-based and fine tuning approaches. Traditional
vinced that triple embedding is always the main procedure and feature-based word embedding methods like Word2Vec and
47
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Glove [92] aim to learn context-independent word vectors. ELMo infer new unobserved triples from existing triples. Excerpts from
generalized traditional word embedding to context-aware word large input graphs are regarded as the simplified and meaningful
embedding, where word polysemy can be properly handled. context of a group of entities in a given domain. Next, based on
Mostly, these word embeddings learned from them are often the context graph, CBOW [75] and Skip-Gram [245] models are
used as initialization vectors during the KGC process. Different used to model KG embedding and perform KGC. In this method,
from the former method, fine-tuning approaches such as GPT the semantic rules between words are preserved to adapt to
and BERT use the pre-trained model structure and parameters as entities and relationships. Satisfactory results have been obtained
the starting point of specific tasks (KGC task we care about). The in some specific field.
pre-trained model learns rich semantic patterns from free text. The well-known BERT [243] is a prominent PLM by pre-
training the bidirectional Transformer encoder [246] through
Lexical Resources Auxiliary Embedding Model (LRAE) [231] ex-
masked language modeling and next sentence prediction. It can
plores methods to provide vector initialization for TransE by using
capture rich linguistic knowledge in pre-trained model weights.
the semantic information of entity description text. LRAE exploits
entity descriptions that are available in WordNet and Freebase As this basis, a number of KGC models try to exploit BERT or its
datasets. The first sentence of a given entity description is first variants for learning knowledge embedding and predicting facts:
selected and then decomposed into a series of word vectors KG-BERT [236] treats entity and relation descriptions of triples
(the first sentence is often most relevant to the described en- as textual sequences inputting to BERT framework, and natu-
tity, which avoids noise interference and large-scale computation rally regards KGC problems as corresponding sequence classi-
from lengthy description text), next all those vectors are averaged fication problems. KG-BERT computes the scoring function of
to form embeddings that represent the overall description seman- serialized triples with a simple classification layer. During the
tics of the entity, where word vectors are computed by Word2vec BERT fine-tuning procedure, they can obtain high-quality triple
[75] and GloVe [92]. These processed descriptive text vectors representations, which contain rich semantic information.
are used as the initialization vectors of the translation model
and are input to TransE for training. LRAE provides initialization KEPLER [237] encodes textual entity descriptions with RoBERTa
vectors for all entities, even including those not present in the [238] as their embedding, and then jointly optimizes the KG
data, thus it alleviates the entity sparse issue. Also, LRAE is very embeddings and language modeling objectives. As a PLM, KEPLER
versatile and can be applied directly to other models whose input can not only integrate factual knowledge into language repre-
is represented by solid vectors. sentation with the supervision from KG, but also produce effec-
tive text-enhanced KG embeddings without additional inference
RLKB [232] modifies DKRL by developed a single-layer proba- overhead compared to other conventional PLMs.
bilistic model that requires fewer parameters, which measures
the probability of each triple and the corresponding entity de- BLP [239] proposes a holistic evaluation framework for entity
scription, obtains contextual embeddings of entities, relations, representations learned via the inductive LP. Consider entities not
and words in the description at the same time by maximizing a seen during training, BLP learns inductive entity representations
logarithmic likelihood loss. based on BERT, and performs LP in combination with four dif-
ferent relational models: TransE, DistMult, ComplEx, and SimplE.
Jointly-Model [233] proposes a novel deep architecture to uti- BLP also provides evidence that the learned entity representations
lize both structural and textual information of entities, which transfer well to other tasks (such as entity classification and
contains three neural models to encode the valuable information information retrieval) without fine-tuning, which demonstrates
from the text description of entity: Bag-of-Words encoder, LSTM that the entity embeddings act as compressed representations
encoder and Attentive LSTM encoder, among which an attentive of the most salient features of an entity. This is additionally
model can select related information as needed, because some important because having generalized vector representations of
of the words in an entity’s description may be useful for the KGs is useful for using them within other tasks.
given relation, but may be useless for other relations. The Jointly-
Model chooses a gating mechanism to integrate representations Structure-augmented text representation (StAR) [240]
of structure and text into a unified architecture. augments the textual encoding paradigm with KGE techniques to
learn KG embeddings for KGC. Following translation-based KGE
Including Text Literals in KGloVe (KGloVe-literals) [234] com- methods, StAR partitions each triple into two asymmetric parts.
bines the text information in entity attributes into KG embed- These parts are then encoded into contextualized representa-
dings, which is a preliminary exploration experiment based on tions by a Siamese-style textual encoder. To avoid combinatorial
KGloVe: it firstly performs KGloVe step to create a graphical co- explosion of textual encoding approaches, e.g., KG-BERT, StAR
occurrence matrix by conducting a personalized PageRank (PPR) employs a scoring module involves both deterministic classifier
on the (weighted) graph; at the same time, it extracts information and spatial measurement for representation and structure learn-
from the DBpedia summary by performing Named Entity Recog- ing respectively, which also enhances structured knowledge by
nition (NER) step, in which the words representing the entity are exploring the spatial characteristics. Moreover, StAR presents a
replaced by the entity itself, and the words surrounding it (and self-adaptive ensemble scheme to further boost the performance
possibly other entities) are contained in the context of the entity; by incorporating triple scores from existing KGE models.
then the text co-occurrence matrix is generated in collaboration
with the list of entities and predicates generated in the KGloVe 4.2.2.4. Discussion on KGC using third-party data source. Based
step. Finally, a merge operation is performed to combine the on the above introduction of KGC using the third-party data
two co-occurrence matrices to fuse the text information into the source (almost all are textual corpus), we give our corresponding
latent feature model. Although the gain of this work is very small, analysis as follows:
it can provide new ideas for the joint learning of attribute text
1. In a narrow sense, this part of KGC studies emphasize the uti-
information and KG embedding.
lize of additional data source outside KGs, but you may be aware
Context Graph Model [235] finds hidden triples by using the that these literals tend to apply PLMs in their works, which takes
observed triples in incomplete graphs. This paper is based on us to think about the application of ‘third party data’ in a broader
the neural language embedding of context graph and applies sense: these PLMs either possess plenty of parameters which have
the similar structure extracted from the relation similarity to trained on large scale language corpus, or provide ready-made
48
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 27
Statistics of a part of TKGC technologies.
Model Loss functiona Whether consider time periods Datasets
Temporal order dependence models:
TransE-TAE [252] Lmarg no YAGO2
Diachronic embedding models:
DE-Simple [253] Sampled Lmll No ICEWS14, ICEWS15-05, GDELT15/16
ATiSE [254] Self-adversarial Lns Yes ICEWS14, ICEWS05-15, Wikidata12k, YAGO11k
Temporal Information embedding models
TTransE [255] Lmarg No Wikidata
HyTE [256] Sampled Lmll Yes Wikidata12k, YAGO11k
ConT [257] LBRL No ICEWS14,GDELT
TA-DisMult [258] Sampled Lmll No YAGO-15k, ICEWS14, ICEWS05-15, Wikidata
TNT-ComplEx [259] Instantaneous Lmll Yes ICEWS14, ICEWS15-05, YAGO-15k, Wikidata40k
Dynamic evolution models:
Know-Evolve [260] Conditional intensity function No GDELT, ICEWS14
RE-NET [261] Total classification LCE No ICEWS18, GDELT18
GHN [262] Total classification LCE No ICEWS18, GDELT15/16
TeMP [263] Sampled Lmll No ICEWS14, ICEWS05-15, GDELT
a
As usual, LCE , Lmarg and Lns refers to cross entropy loss, margin-based ranking loss and negative sampling loss respectively. Besides, the Lmll means the multiclass
log-loss, and LBRL refers to the binary regularized logistic loss.

semantically-rich word embeddings, thus when we say a KGC models. Naturally, a summary table is made to sum up all the
work uses a PLM, we would think about it gets assistance from TKGC methods introduced in our overview (Table 27).
the additional language information (from other large language
Temporal Knowledge Graphs (TKGs) and TKGC: For such KGs
corpora, on which the PLM has been fully trained). In other words,
with temporal information, we generally call them TKGs. Natu-
we should not judge a KGC model whether use third-party data
rally, the completion of such KGs is called TKGC, and the original
source merely according to their used datasets, it is especially
triples are redefined as quadruples (h, r , t , T ) where T is the time
important to focus on the details of the model most of the time.
(which can be a timestamp or a time span as [Tstart , Tend ]) [252].
2. As we have discussed in 4.2.2.1, PLMs have an important role With studying time-aware KGC problems, it helps to achieve
in capturing rich semantic information which is helpful to KGC. more accurate completion results, i.e., in LP task, we can dis-
Along with a growing number of assorted PLMs are proposed, in tinguish which triple is real in a given time condition, such as
particular, the models jointly learn language representation from (Barack Obama, President of , USA, 2010) and (Bill Clinton,
both KGs and large language corpus, some PLM models intro- President of , USA, 2010). In addition, some literature also pro-
duce structure data of KGs into the pre-training process through poses time prediction task that predicting the most likely time for
specific KGC tasks to obtain more reasonable language model the given entity and relation by learning the time embeddings vT .
parameters (such as ERNIE [247], CoLAKE [248–251]). In the fu- According to the usage manner of temporal information, we
ture, to explore an efficient joint learning framework derive entity roughly categorize recent TKGC methods into four groups: tem-
representations from KGs and language corpus may be needed, poral order dependence model, diachronic embedding model,
and the key point is how to design the interaction between these temporal information embedding model and dynamic evolu-
two data source, an iterative learning manner, just as the Rule- tion model.
KG embedding series worked, maybe a possible future direction.
What is needed is a method to derive entity representations that 5.1.1. Temporal order dependence models
work well for both common and rare entities. The mentioned temporal order information indicates that un-
der the time condition, some relations may follow a certain order
5. Other KGC technologies timeline, such as BornIn → WorkAt → DiedIn.
TransE-TAE [252] firstly incorporates two kinds of temporal in-
In this part we focus on several other KGC techniques oriented formation for KG completion: (a) temporal order information and
at the special domain, including Temporal Knowledge Graph (b) temporal consistency information. To capture the temporal
Completion (TKGC) in Section 5.1 that concerns time elements in order of relations, they tend to design a temporal evolving matrix
KGs; CommonSense Knowledge Graph Completion MT , with which a prior relation can evolve into subsequent rela-
(CSKGC) which is a relatively new field about commonsense KGs tion (as Fig. 28 shows). Specifically, given two facts having same
studying (see Section 5.2), and Hyper-relational Knowledge Graph head entity (ei , r1 , ej , T1 ) and (ei , r2 , ek , T2 ), it assumes that prior
Completion (HKGC) that pays attention to n-ary relation form relation r1 projected by MT should be near subsequent relation r2 ,
instead of usual 2-nary triples in KGs (see Section 5.3). i.e., r1 MT ≈ r2 . In this way, TransE-TAE allows to separate prior
relation and subsequent relation automatically during training.
5.1. Temporal Knowledge Graph Completion (TKGC) Note that the temporal order information finally is treated as
a regularization term injected into original loss function, being
At present, many facts in KGs are affected by temporal infor- optimized together with KG structural information.
mation, owing to the fact in the real world are not always static
but highly ephemeral such as (Obama, Presidentof , USA) is true 5.1.2. Diachronic embedding models
only during a certain time segment. Intuitively, temporal aspects This kind of models often design a mapping function from
of facts should play an important role when we perform KGC time scalar to entity or relation embedding, input both time and
[252]. In this section, we briefly introduce some famous TKGC entity/relation into a specific diachronic function framework to
49
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Concerning the earlier work TransE-TAE [252] (which learns


non-explicit time-aware embeddings as it did not directly intro-
duce temporal information into embedding learning), TTransE
[255] and HyTE [256] integrate time embedding into the distance-
based score function with the idea of TransE and TransH, the
former explores three methods of introducing time factor into
basic TransE, among them the vector-based TTransE performs
Fig. 28. Simple illustration of Temporal Evolving Matrix T in the time-aware excellent results which directly models time embedding as same
embedding (TAE) space [252].
as entity or relation embeddings, i.e., for a quadruples (h, r , t , T ),
score = −∥vh +vr +vT −vt ∥, while the latter HyTE applies a time-
aware KG embedding method based on time hyperplane, after
get time-aware entity/relation embedding, which can be directly projected onto certain time hyperplane at timestamp T , PT (ei ) in
combined with the existing KGE models. time T , each entity or relation is represented as the follows form:
DE-SimplE [253] extends the previous static model SimplE with PT (vx ) = vx − (wT⊤ vx )wT
the diachronic entity embedding function (DEEMB, which pro-
vides the characteristics of the entity at any time point), whose where wT means the corresponding normal vector of current time
input is entity and time stamp while output is entity’s hid- hyperplane, then defines a score function of quadruples (h, r , t , T )
den representation at that time-step. This embedding method is as:
called diachronic embedding (DE). Any static KGE methods can be fT (h, r , t) = ∥PT (vh ) + PT (vr ) − PT (vt )∥
extended to the relevant TKGE (Temporal KGE) models by using
DEEMB as follows: which follows the transitional characteristics.
av [n]σ (wv [n]T + bv [n]), 1 ≤ n ≤ γd ConT [257] is an extension of Tucker [37] defining a core tensor w
{
if
zvT [n] =
av [n], if γd ≤ n ≤ 0 for each time stamp. TADisMult [258] combines tokenized time
and relation into predicate sequence which input into RNN to
where zvT [n] represents the nth element of the d-dimensional learn temporal relation representation while TNTComplEx [259]
entity vector, which is calculated in two parts: the first part cap- adopts unfolding of 4-way tensor modes.
tures the temporal characteristics of the entity, and the function
adopts sin() to learn a set of parameters a, w, b for each entity; 5.1.4. Dynamic evolution models
the second part captures the static characteristics of the entity, Dynamic evolution models dynamically learn entity embed-
i.e., to keep the original entity embedding unchanged. In other dings along with time steps. This kind of methods like Know-
words, DE-SimplE can learn how to open and close entity time- Evolve [260] calls the phenomenon that entities and relations
series features at different time points with the use of sin(), so change dynamically over time as knowledge evolution, and it
as to accurately predict their time at any time. At the same time, models nonlinear evolution representation of entities under this
by combining SimplE [44] (static KGE model) with DE, DE-SimplE scene. Know-Evolve is used in the reasoning of TKGs, which
achieves fully expressive (an important standard measuring the designs a novel RNN structure for dynamic evolution represen-
quality of KGE model proposed in SimplE). tation learning of entity and sets a specific loss function based on
ATiSE [254] introduces additive time series decomposition to relational score function, like RESCAL [13]. Besides, recent works
function on this basis. ATiSE thinks that the evolution of entity use neighborhood aggregation information to predict probability
and relation representation is random because the entity char- of event occurrence including RE-NET [261], GHN [262] and TeMP
acteristics at a certain time are not completely determined by [263] by Graph Convolution Network (GCN) [82].
the past information, thus they map the entity and relation into
a multi-dimensional Gaussian distribution, the mean vector of 5.1.5. Performance comparison of TKGC models
each entity at a certain time step represents the current ex- Datasets: There are part of datasets specialized in TKGC task and
pectation, and the covariance represents the uncertainty of time several TKGC datasets are shown in Table 28. We make a brief
(the constant diagonal matrix is used to improve efficiency). For introduction about them as follows:
the problem that DE-SimplE only considers time points, ATiSE
ICEWS The Integrated Conflict Early Warning System (ICEWS)
extends to the time span, which means a triple whose time-step
[264] is a natural episodic dataset recording dyadic events be-
within the begin time point and end time point is regarded as a
tween different countries, which was first created and used in
positive triple. The diachronic embedding function of entities in
[265], where a semantic tensor is generated by extracting consec-
the current time step T is as follows:
∑ utive events that last until the last timestamp. After that, Icews14,
ei,T = ei + αe,i we,i T + βe,i sin(2πwe′ ,i T ) + (0, e, i) icews05-15 and icews18 are subsets of ICEWS, corresponding to
the facts of 2014, 2005–2015 and 2018 respectively. These three
The entity embedding calculated by the above formula will be datasets are filtered by only selecting the most frequent entities
regarded as the mean value vector ∑ ēs,T in multi-dimensional in the graph, and all the time labels inside them are time points.
Gaussian distribution Ps,T ∼ N (ēs,T , s ) of the certain entity.
Similar to DE-SimplE, ATiSE also can extend any traditional static GDELT The Global Database of Events, Language and Tone (GDELT)
KGC model developed to the TKGC model, but it cannot give full [264] monitors the world’s news media in broadcast, print, and
play to the ability of time expression. web formats from all over the world, daily since January 1, 1979.
As a large episodic dataset, the data format inside it is similar
to ICEWS, i.e., (es , ep , eo , et ) quadruples, these events also usually
5.1.3. Temporal information embedding models
be aggregated into an episodic tensor. GDELT15-16, GDELT18 are
Temporal information embedding models introduce temporal
subsets of GDELT.
information into a specific traditional KGC baseline, like trans-
lation model or tensor decomposition model, for learning time- YAGO15K is created firstly using FB15K [11] by aligning entities
aware embeddings and training time-aware scoring function. from FB15K to YAGO [266] with SAMEAS relations contained in a
50
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 28
Statistic of several Temporal Knowledge Graph datasets.
Dataset Entity Relation Fact Timestamps
#Train #Valid #Test
Time Slot-based dataset
Wikidata [258] 11 134 95 121 442 14 374 14 283 1726 (1 year)
Wikidata12K [256] 12 554 24 32.5k 4k 4k 232 (1 year)
YAGO11K [256] 10 623 10 16.4k 2k 2k 189 (1 year)
YAGO15K [258] 15 403 34 110 441 13 815 13 800 198 (1 year)
Fact-based dataset
ICEWS 14 [253] 7128 230 72 826 8941 8963 365 (1 day)
ICEWS 18 [253] 23 033 256 373 018 45 995 49 545 304 (1 day)
ICEWS 05-15 [253] 10 488 251 386 962 46 275 46 092 4017 (1 day)
GDELT(15-16) [253] 500 20 2 735 685 341 961 341 961 366 (1 day)
GDELT(18) [261] 7691 240 1,734,399 238,765 305,241 2751 (15 min)

Table 29
Evaluation results of TKGC on ICEWS14, ICEWS05-15 and GDELT datasets. Best results are in bold.
ICEWS14 ICEWS05-15 GDELT
MRR Hits@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10
TransE [11] 0.280 0.094 – 0.637 0.294 0.090 – 0.663 0.113 0.0 0.158 0.312
DisMult [42] 0.439 0.323 – 0.672 0.456 0.337 – 0.691 0.196 0.117 0.208 0.348
SimplE [44] 0.458 0.341 0.516 0.687 0.478 0.359 0.539 0.708 0.206 0.124 0.220 0.366
ComplEx [43] 0.456 0.343 0.516 0.680 0.483 0.366 0.543 0.710 0.226 0.142 0.242 0.390
TTransE [255] 0.255 0.074 – 0.601 0.271 0.084 – 0.616 0.115 0.0 0.160 0.318
HyTE [256] 0.297 0.108 0.416 0.655 0.316 0.116 0.445 0.681 0.118 0.0 0.165 0.326
TA-DistMult [258] 0.477 0.363 – 0.686 0.474 0.346 – 0.728 0.206 0.124 0.219 0.365
ConT [257] 0.185 0.117 0.205 0.315 0.163 0.105 0.189 0.272 0.144 0.080 0.156 0.265
DE-TransE [253] 0.326 0.124 0.467 0.686 0.314 0.108 0.453 0.685 0.126 0.0 0.181 0.350
DE-DistMult [253] 0.501 0.392 0.569 0.708 0.484 0.366 0.546 0.718 0.213 0.130 0.228 0.376
DE-SimplE [253] 0.526 0.418 0.592 0.725 0.513 0.392 0.578 0.748 0.230 0.141 0.248 0.403
ATiSE [254] 0.545 0.423 0.632 0.757 0.533 0.394 0.623 0.803 – – – –
TeMP-GRU [263] 0.601 0.478 0.681 0.828 0.691 0.566 0.782 0.917 0.275 0.191 0.297 0.437
TeMP-SA [263] 0.607 0.484 0.684 0.840 0.680 0.553 0.769 0.913 0.232 0.152 0.245 0.377
TNTComplEx [259] 0.620 0.520 0.660 0.760 0.670 0.590 0.710 0.810 – – – –

YAGO dump, and kept all facts involving those entities. Then, this 5.1.6. Analysis of TKGC models
collection of facts are augmented with time information from the Inspired by the excellent performance of translation model
yagoDateFacts dump. Contrary to the ICEWS data sets, YAGO15K and tensor factorization model in traditional KGC, temporal
does contain temporal modifiers, namely, ‘occursSince’ and ‘oc- knowledge graph completion (TKGC) mainly introduces temporal
cursUntil’ [258]. What is more, all facts in YAGO15K maintain embedding into the entity or relation embedding based on the
time information in the same level of granularity as one can find above two kinds of KGC ideas. Recently, with the wide application
in the original dumps these datasets come from, this is different of GCN in heterogeneous graphs, more and more TKGC methods
from [255]. adopt the idea of ‘‘subgraph of a TKG’’ [261] we call it temporal
subgraph, which aggregate the neighborhood information at each
YAGO11k [256] is a rich subgraph from YAGO3 [267], includ- time, and finally collaborate with the sequence model RNN to
ing top 10 most frequent temporally rich relations of YAGO3. complete the time migration between subgraphs. Future methods
By recursively removing edges containing entities with only a may continue to explore the construction of temporal subgraphs
single mention in the subgraph, YAGO11k can handle sparsity and show solicitude for the relevance between time subgraphs.
effectively and ensure healthy connectivity within the graph. In addition, more attention may be paid to the static information
Wikidata Similar to YAGO11k, Wikidata contains time interval that existed in TKG, so as to promote the integration of TKGC and
information. As a subset of Wikidata, Wikidata12k is extracted traditional KGC methods.
from a preprocessed dataset of Wikidata proposed by [255], its
created procedure follows the process as described in YAGO11k, 5.2. CommonSense Knowledge Graph Completion (CSKGC)
by distilling out the subgraph with time mentions for both start
and end, it ensures that no entity has only a single edge connected CommonSense knowledge is also referred as background knowl-
to it [256], but it is almost double in size to YAGO11k. edge [268], it is a potentially important asset towards building
Performance Results Comparison: We report some published ex- versatile real-world AI applications, such as visual understand-
perimental results about TKGC methods in Table 29, from which ing for describing images (e.g., [269–271]), recommendation sys-
we find that TeMP-SA and TeMP-GRU achieve satisfying results tems or question answering (e.g., [272–274]). Whereby a novel
on all three datasets across all evaluated metrics. Compared to kind of KGs involve CommonSense knowledge is emerged, Com-
the most recent work TNTComplex [259] — which achieves the monSense knowledge graphs (CSKGs), we naturally are inter-
ested in the complement of CSKGs, here give a presentation of
best performance on the ICEWS datasets before TeMP, are 8.0%
series CommonSense Knowledge Graph Completion (CSKGC)
and 10.7% higher on the Hits@10 evaluation. Additionally, TeMP
techniques. The corresponding summary table involves described
also achieves a 3.7% improvement on GDELT compared with DE,
CSKGC methods shown in Table 30.
the prior state-of-the-art on that dataset, while the results of the
AtiSEE and TNTComplEx methods on the GDELT dataset are not CommonSense knowledge graphs (CSKGs) almost provide a
available. confidence score along with every relation fact, for representing
51
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 30
Statistics of recent popular CommonSense KGC technologies.
Model Technology Information Datasets
Language Auxiliary CSKGC Models with Pre-trained Language Models:
NAM [64] Neural Association Model, Large unstructured texts CN14
neural networks:
DNN and relation-modulated neural nets (RMNN),
probabilistic reasoning,
PLMs: skip-gram
DNN-Bilinear DNN, Text phrases ConceptNet 100K
[275] Bilinear architecture,
averaging the word embeddings (DNN AVG, Bilinear AVG),
max pooling of LSTM (DNN LSTM, Bilinear LSTM),
PLMs: skip-gram
CSKGC-G [268] DNN AVG in [275], Text phrases ConceptNet
attention pooling of DNN LSTM, 100K,
bilinear function, JaKB
defining CSKG generation task
COMET [276] Automatic CSKG generation, CSKG structure and relations ConceptNet,
adaptable framework, ATOMIC
GPT,
multiple transformer blocks of multi-headed attention
MCC [277] End-to-end framework, Graph structure of local ConceptNet,
encoder: GCNs + fine-tuned BERT, neighborhood, ATOMIC
decoder: ConvTransE, semantic context of nodes in KGs
A progressive masking strategy
CSKGC with Logical Rules:
UKGEs [278] Uncertain KGE, Structural and uncertainty ConceptNet,
probabilistic soft logic information CN15k,
of relation facts NL27k,
PPI5k
DICE [279] ILP (Integer linear programming), CommonSense knowledge ConceptNet,
weighted soft constraints, statements (four dimensions), Tuple-KB,
the theory of reduction costs of a relaxed LP, taxonomic hierarchy related concepts Qasimodo
joint reasoning over CommonSense,
knowledge statements sets

Table 31 5.2.1. Commonsense Knowledge Graph Completion


ConceptNet tuples with left term ‘‘soak in hotspring’’; final column is confidence As the existing CommonSense knowledge in CSKGs is far from
score [275].
sufficient and thorough, it is natural to introduce the Common-
Relation Right term conf.
sense Knowledge Graph Completion (CSKGC) task. While there
MOTIVATEDBYGOAL Relax 3.3
has been a substantial amount of work on KGC for conventional
USEDFOR Relaxation 2.6
MOTIVATEDBYGOAL Your muscle be sore 2.3 KGs such as Freebase [277], relatively little work exists for KGC
HASPREREQUISITE Go to spa 2 for CSKGs such as ATOMIC [283] and ConceptNet [284].
CAUSES Get pruny skin 1.6 The work in [285] enters into meaningful discussions with the
HASPREREQUISITE Change into swim suit 1.6
rationality and possibility of KGC models for mining Common-
Sense knowledge (CSKM), through a series of complex analysis
about multiple KGC baseline models: the Factorized model, the
the likelihood of the relation fact to be true. Some famous uncer- Prototypical model, and the DNN model, and designs the com-
tain KGs include ProBase [280], ConceptNet [281] and NELL [282], pared model as the Bilinear model of [275]. They propose a
among which the ConceptNet [281] is a multilingual uncertain KG novelty metric to re-evaluate these KGC models aforementioned
for CommonSense knowledge that is collected via crowdsourcing and analyze splitting candidate triples for the mining task. In a
[278], and the confidence scores in ConceptNet mainly come from word, the abundant analysis with respect to the potential corre-
the co-occurrence frequency of the labels in crowdsourced task lation between existing KGC models and CSKGC task and several
results. The curated commonsense resource ConceptNet contains first steps towards a more principled evaluation methodology will
tuples consisting of a left term, a relation, and a right term, provide helpful experiences for further exploration about CSKM.
this form about some examples just like Table 31 shows. The More specifically, based on the distinct goals, many researchers
relations come from a fixed set. While terms in Freebase tuples identify unique challenges in CSKGC and further investigate effec-
are entities, ConceptNet terms can be arbitrary commonsense tive methods to address these challenges. Here, we will introduce
phrases. Normally, for the examples in Table 31, a NLP application
the currently CSKGC methods according to their used technolo-
may wish to query this kind of commonsense phase collections
gies in two main categories: Language Auxiliary CSKGC Models
for information about ‘‘soaking in a hotspring’’, but may use
with Pre-trained Language Models and CSKGC with Logical
distinct words from those contained in the existing tuples.
Rules as shown in Table 30.
Data format: Facts in CSKGs is often represented in RDF-style
triples (h, r , t), where h and t are arbitrary words or phrases, 5.2.2. Language auxiliary CSKGC models with pre-trained language
and r ∈ R is a relation between h and t [268]. Taking triple models
(go to restaurant , subev ent , orderfood) for an instance, it means
a commonsense: ‘‘order food’’ happens as a sub-event of ‘‘go to Neural association model (NAM) [64] applies a deep learning
restaurant’’. framework to model the association between any two events
52
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

in a domain by computing a conditional probability of them.


The work conducts two case studies to investigate two NAM
structures, namely deep neural networks (DNN) and relation-
modulated neural nets (RMNN). In the experiment, this work
evaluates CSKGC task across ConceptNet CSKG, the results highly
appreciated that both DNNs and RMNNs perform equally well
and they can significantly outperform the conventional methods
available for these reasoning tasks. Moreover, to further prove the
effectiveness of the proposed models when reasoning new Com-
monSense knowledge, the work tries to apply NAMs to solve chal-
lenging Winograd Schema (WS) problems and the subsequent
experiments performances prove that NAMs have the potential
for commonsense reasoning.
Fig. 29. Architecture of CSKGC-G Model. The completion part estimates the
DNN-Bilinear Model [275] attempts to use bilinear model and score of (h = ‘play game’; r = ‘HasPrerequisite (HP)’; t = ‘know rule’), and
the generation module generates t from (h; r) and h from (t ; r ′ ). r ′ : HP denotes
DNN for CSKGC study. Specifically, they designs two strategies
the reverse direction of ‘HasPrerequisite’ [268].
of both two structures to model commonsense phrases: directly
averaging the word embeddings (called DNN AVG, Bilinear AVG)
or using max pooling of LSTM (called DNN LSTM, Bilinear LSTM).
Formally, they define the score function of a triplet (h, r , t) about
both bilinear and DNN models respectively as follows: COMET [276] is an automatic generation model for CSKGs. This
adaptation framework constructs CSKG by using a seed set of
scorebilinear (h, r , t) = uTh Mr ut existing knowledge tuples, where contain rich information of KG
structure and relations, and operates a large-scale transformer
ux = a(W (B) vx + b(B) ), i = h, t language model (GPT in [242]) with multiple transformer blocks
of multi-headed attention among these prepared seed sets to
and: produce CommonSense knowledge tuples.
scoreDNN (h, r , t) = W (D2 ) (a(W (D1 ) vin + b(D1 ) )) + b(D2 ) Machine Commonsense Completion (MCC) [277] performs
CSKGC by utilizing structure and semantic context of nodes in
vin = concat(vht , vr ) ∈ Rde +dr KGs. CSKGs have significantly sparser and magnitude larger graph
structures compared with conventional KGs, therefore it throws a
where vh , vt ∈ Rde is the vector representing h and t, and vr is the major challenge for general KGC approaches that assume densely
relation embedding. Mr ∈ Rdr ×dr means the parameter matrix for connected graphs over a relatively smaller set of nodes. In this
relation r, and vht ∈ Rde is a phrase representation of concate- work, a joint model is presented with a Graph Convolutional
nating h and t. The function a() is a nonlinear activation function Networks (GCNs) [78] and a fine-tuned BERT [243] model as
and the W (B) , W (Dx ) ; b(B) , b(Dx ) (x = h, t) are weight matrix and the encoder side to learn information from the graph structure.
bias matrix of bilinear model and DNN model, respectively. ConvTransE [71] is chosen as the decoder side to get a tuple’s
Completion and Generation Model (CSKGC-G) [268] further im- strong score. As for the encoder process, the GCN model first
proves [275] by replacing the max pooling to attention pooling integrates the representation of a node according to its local
in DNN LSTM structure and adding a bilinear function, the phrase neighborhood via the synthetic semantic similarity links, and
embedding of (h, r , t) is formulated into: fine-tune BERT is used to then transfer learning from text to KGs.
A progressive masking strategy further ensures that the model
hiddenjx = BiLSTM(vxj , hij−1 ), (x = h, t) appropriately utilizes information from both sources.

J
∑ exp(ej ) 5.2.3. CSKGC with logical rules
vx = ∑J hiddenjx , (x = h, t)
exp(ek ) uncertain KGEs (UKGEs) [278] explores the uncertain KGE ap-
j=1 K =1
proaches, including CSKGC research. Preserving both structural
ek = w T tanh(Whiddenkx ), (x = h, t) and uncertainty information of triples in the embedding space,
UKGEs learns embeddings according to the confidence scores
of uncertain relation facts and further applies probabilistic soft
vht = Bilinear(vh , vt )
logic to infer confidence scores for unseen relation facts during
training.
vin = concat(vht , vr )
Diverse CommonSense Knowledge (DICE) [279] is a
Except for the commonly used variable, the J means the word multi-faceted method with weighted soft constraints to couple
length of phrase h (or t), w is a linear transformation vector for the inference over concepts (that are related in a taxonomic hier-
j j
calculating the attention vector. Besides, vx and hiddenx are the jth archy) for deriving refined and expressive CommonSense knowl-
word embedding and hidden state of the LSTM for phrase x, (x = edge. To capture the refined semantics of noisy CommonSense
h, t). Another highlight in [268] is that it develops a commonsense knowledge statements, they consider four dimensions of concept
knowledge generation model which shares information with the properties: plausibility, typicality, remarkability and saliency, and
CSKGC part, its framework is shown in Fig. 29. This devised model the coupling of these dimensions by a soft constraint
model jointly learns the completion and generation tasks, which system, which expresses inter-dependencies between the four
improves the completion task because triples generated by the CommonSense knowledge dimensions with three kinds of logical
generation model can be used as additional training data for the constraints: Concept-dimension dependencies, Parent–child depen-
completion model. In this way, this work allows to increase the dencies and Sibling dependencies, enabling effective and scalable
node size of CSKGs and increase the connectivity of CSKGs. joint reasoning over noisy candidate statements. Note that the
53
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 32 Table 33
Statistic of CommonSense Knowledge Graph datasets. Summary about CSKGC models.
Dataset Entity Relation Fact Model Completion Generation

#Train #Val1 #Val2 #Test NAMs [64] √ √
ATOMIC 256,570 9 610,536 – – – DNN-Bilinear [275] √ √
CN14 159,135 14 200,198 5000 – 10,000 CKGC-G [268] √ √
JaKB 18,119 7 192,714 13,778 – 13,778 COMET [276] √
CN-100K 78,088 34 100,000 1200 1200 2400 MCC [277] √
CN15k 15,000 36 241,158 UKG embedding [278] √
NL27 27,221 404 175,412 Dice [279]
PPI5k 4999 7 271,666

to be viewed at least once, [268] creates a new random 80-10-


over-mentioned reasoning is then cast into an integer linear 10 partition for the dataset with development set and test set
programming (ILP), and they also leverage the theory of reduction consisting of 87k tuples.
costs of a relaxed LP to compute informative rankings. After NL27k is extracted from NELL [282], an uncertain KG obtained
experiments on large CommonSense knowledge collections, Con- from web-page reading.
ceptNet, TupleKB, and Quasimodo, as long as human judgments, PPI5k [287] labels the interactions between proteins with the
it finally results in a publicly available CSKG containing more than probabilities of occurrence. PPI5k is a subset of STRING, it is a
1.6M statements about 74k concepts. denser graph with fewer entities but more relation facts than
NL27 and CN15K.
Ja-KB The open-domain Ja-KB (Japanese CommonSense knowl-
5.2.4. Performance analysis of CSKGC models
edge) is created using crowdsourcing like in Open Mind Common
Datasets: We list some CSKG datasets in Table 32 to show their Sense (OMCS) [288] to evaluate the robustness of CSKGC models
basic data statistics. in terms of the language and long phrases [268]. By limiting
the relation types often containing nouns and verbs, Ja-KB owns
ConceptNet As we have introduced before, ConceptNet [284] is a fewer relation labels than that of ConceptNet. The relation set
large-scale and multi-lingual CSKG. The evaluation set, which is of Ja-KB including Causes, MotivatedBy, Subevent, HasPrerequisite,
created from a subset of the whole ConceptNet, consists of data ObstructedBy, Antonym, and Synonym, and its average length of
only in English and contains many short phrases including single phrases is longer than in ConceptNet. Since data annotated by
words [268]. CN14, CN-100K and CN15k are all the subsets of crowd workers is usually noisy, the Ja-KB created procedure
ConceptNet. performed a two-step data collection process to eliminate noisy
data, a data creating step, and an evaluation step.
ConceptNet-100K (CN-100K) [275] contains general common-
TupleKB is extracted from web sources with focus on the science
sense facts about the world. The original version contains the
domain, with comparably short and canonicalized triples [279].
Open Mind Common Sense (OMCS) entries from ConceptNet,
whose nodes contain 2.85 words on average. Its dataset splits Qasimodo is a web-extracted general-world CommonSense
are shown as Table 32. Following this original splits from the knowledge collection with focus on saliency [279].
dataset, [277] combines the two provided development sets to Analysis about CSKGC models: Here we throw out a plain analysis
create a larger development set, thus the development and test on CSKGC models. The generation models can produce appre-
sets consisted of 1200 tuples each. ciated new explicit knowledge from original diverse and noisy
commonsense phrase collections, in general, they are affected by
CN14 Liu et al. [64] uses the original ConceptNet [286] to con- language corpus or pre-trained language models to generalize
struct CN14. When building CN14, they first select all facts in commonsense language representations, whose target is to add
ConceptNet related to 14 typical commonsense relations and then novel nodes and edges to the seed CSKGs. Generative models such
randomly divide the extracted facts into three sets, Train, Dev, as COMET can generate novel knowledge that approaches human
and Test. In the end, to create a test set for classification, they performance. This research pointed out a plausible alternative to
randomly switch entities (in the whole vocabulary) from correct extractive methods that using generative commonsense models
for automatic CSKGC. By comparison, the CSKGC models tend to
triples and get a total of 2×#Test triples (half are positive samples
search potential valid edges in existing CSKGs. An intuitive table
and half are negative examples). is shown in Table 33, which roughly sums up completion and
CN15k is a subgraph of ConceptNet, it matches the number of generation models. However, the main finding in [277] about
nodes with FB15k [11], and contains 15,000 entities and 241,158 the generative model to completion task is that such generative
model cannot easily be re-purposed to rank tuples for KGC, the
uncertain relation facts in English [278].
experimental results as evidence shown in Table 34, this may
ATOMIC contains social CommonSense knowledge about day-to- because of the problems associated with using log-likelihood as
day events [268]. This dataset specifies the effects, requirements, an estimate for the truth of a tuple. Nevertheless, generative
intentions, and attributes of the participants in the event. The models such as COMET have several merits. These models possess
average phrase length of nodes (4.40 words) is slightly higher faster training speed, require lower storage memory, and are
transductive naturally. Furthermore, the work in [277] indicates
than that of CN-100k, and there may be multiple targets in the
that reasoning models that rely on KGs could favor discriminative
source entity and source relation. Tuples in this graph may also
approach towards CSKG induction since that would make the
contain none targets when the relation type does not need to be graph denser without adding new nodes.
annotated. The original dataset segmentation is created to make Saito et al. [268] exhibits a shared model that may help pro-
the seed entity sets between training and evaluation segmenta- mote the CSKGC effect by jointly learning with a generation mod-
tion mutually exclusive. Due to the CSKGC work requires entities ule, in this case, the generation module can generate augmented
54
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 34
CommonSense KGC (CSKGC) evaluation on CN-100K and ATOMIC with subgraph sampling [277]. The baselines are presented in the top of the table, the middle part
shows the KGC results of COMET and the bottom half are the model implementations in [277].
Model CN-100K ATOMIC
MRR Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10
DISTMULT 8.97 4.51 9.76 17.44 12.39 9.24 15.18 18.3
COMPLEX 11.4 7.42 12.45 19.01 14.24 13.27 14.13 15.96
CONVE 20.88 13.97 22.91 34.02 10.07 8.24 10.29 13.37
CONVTRANSE 18.68 7.87 23.87 38.95 12.94 12.92 12.95 12.98
COMET-NORMALIZED 6.07 0.08 2.92 21.17 3.36 0 2.15 15.75
COMET-TOTAL 6.21 0 0 24 4.91 0 2.4 21.6
BERT + CONVTRANSE 49.56 38.12 55.5 71.54 12.33 10.21 12.78 16.2
GCN + CONVTRANSE 29.8 21.25 33.04 47.5 13.12 10.7 13.74 17.68
SIM + GCN + CONVTRANSE 30.03 21.33 33.46 46.75 13.88 11.5 14.44 18.38
GCN + BERT + CONVTRANSE 50.38 38.79 56.46 72.96 10.8 9.04 11.21 14.1
SIM + GCN + BERT + CONVTRANSE 51.11 39.42 59.58 73.59 10.33 8.41 10.79 13.86

like FB15K-237 [32] owns 100x the density of KB than ConceptNet


and ATOMIC.
3. Difficulty to model Uncertain KG using KGE models: It is
a difficult problem to use ordinary KG embedding to obtain
uncertain information such as CommonSense knowledge facts
[278]. This is a very important task for several reasons. Firstly,
compared with the deterministic KG embedding, the uncertain
KG embedding needs to encode additional confidence informa-
tion to keep the uncertainty characteristic. Secondly, the existing
KG embedding models cannot capture the subtle uncertainty of
invisible relational facts, because they assume that all invisible
relational facts are false beliefs and minimize the credibility mea-
sures of relational facts. For uncertain KG embedding learning,
one of the main challenges is to correctly estimate the uncertainty
of invisible relational facts.
4. Irrationality of Structural Information in CSKGs: Another
limitation of existing CommonSense knowledge datasets is that
Fig. 30. Subgraph from ConceptNet illustrating semantic diversity of nodes,
which is represented by non standardized free-form text. Dashed blue lines they organize statements in a fat, one-dimensional way, and the
represent potential edges to be added to the graph [277]. only rank according to the confidence score [279]. It not only
lacks information about whether an attribute is applicable to all
or some instances of a concept but also is short of awareness
reasonable knowledge to further improve CSKGC. In other words, of which attributes are typical and which are prominent from
the loss function of generation module as a good constraint for a human point of view. Take an example in [279], the idea
that hyenas drink milk (when they were young, all mammals
the CSKGC model.
drink milk) is true, but not typical. It is typical for hyenas to
eat meat, but it is not obvious that humans will spontaneously
5.2.5. Challenges of CSKGC name it as a major feature of hyenas. In contrast, the carcass
As a kind of novel KGs, CSKGs have a series of inherently eaten by hyenas is remarkable because it distinguishes hyenas
challenging features: from other African carnivores (such as lions or leopards), which
many people would list as a prominent asset. Previous work
1. Resource Scarcity in CSKGs: Although researchers have de- on CommonSense knowledge has omitted these reference and
veloped lots of techniques for acquiring CSKGs from raw text expression dimensions.
with patterns [289], it has been pointed out that some sorts
of knowledge are rarely expressed explicitly in textual corpora 5.3. Hyper-Relational Knowledge Graph Completion (HKGC)
[290]. Therefore, researchers have developed curated CSKG re-
sources by manual annotation [281]. Although manually created Despite existing embedding techniques have obtained promis-
knowledge has high precision, these resources mostly suffer from ing successes across most commonly KGs, they are all developed
coverage shortage [268]. based on the assumption of a binary relation that knowledge
data instances each involving two entities (such as ‘‘Beijing is the
2. Sparsity of CSKGs: The key challenge in completing CSKGs is
capital of China’’), such binary relational triples are in the form of
the sparsity of the graphs [277]. Different from traditional KGs,
(head entity, relation, tail entity). However, a large portion of the
CSKGs are composed of nodes represented by non standardized knowledge data is from non-binary relations (such as ‘‘Benedict
free-form text, as shown in Fig. 30. For example, nodes ‘‘prevent Cumberbatch played Alan Turing in the movie The Imitation
dental caries’’ and ‘‘dental caries’’ are conceptually related, but Game’’) [291], although these n-ary relational facts usually are
not equivalent, so they are represented as different nodes. This decomposed into multiple triples via introducing virtual entities,
conceptual diversity and graphic expressiveness are essential for such as the Compound Value Type (CVT) entities in Freebase.
expressing commonsense, which whereas means that the number For example, in Freebase [5], more than 1/3 of the entities that
of nodes is several orders of magnitude larger, and the graphics participate are non-binary relations. Noting that some studies [8]
are much sparse than traditional KGs. For example, encyclopedias has indicated that the triple-based representation of a KG often
55
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 35
Statistics of recent popular hyper-relational KGC technologies.
Model Hyper-relational fact representation Information Technology Task
(for n-ary fact(h, r , t) with (ki , vi ))
m-TransH [291] {(rh , rt , k1 , . . . , kn ), n-ary key–value pairs A direct modeling framework Predict entities
(h, t , v1 , . . . , vn )} for embedding multifold relations,
fact representation recovering,
TransH
RAE [112] {(rh , rt , k1 , . . . , kn ), n-ary key–value pairs m-TransH, Predict entities
(h, t , v1 , . . . , vn )} relatedness between entities,
instance reconstruction
NaLP [292] {rh : h, rt : t , ki : vi }, n-ary key–value pairs CNN, Predict entities
i = 1, . . . , n key–value pairs relatedness predict relations
HINGE [8] (h, r , t), Triple data, CNN, Predict entities
{rh : h, rt : t , ki : vi }, i = 1, . . . , n n-ary key–value pairs triple relatedness, key–value pairs relatedness predict relations

the interaction between entities involved in each fact for pre-


dicting missing links in KGs. At the basis of hyper-relational fact
definition, according to the translation idea of TransH, m-TransH
defines its score function of an instance by the weighted sum of
the projection results from its values to its relation hyperplane,
in which the weights are the real numbers projected from its
roles. However, the primary m-TransH does not take care of the
relatedness of the components inside the same n-ary relational
Fig. 31. An example of a hyper-relational fact and its corresponding n-ary fact [292], so that this method does not make full use of the
representation [8]. possible inner relative semantic information in the predefined
fact structure. On the other hand, since m-TransH learns merely
from sets of entities in meta-relations (taking no account of the
oversimplifies the complex nature of the data stored in the KG, in exact relations in each meta-relation), it can be applied to conduct
particular for hyper-relational data, so that it calls for a necessary the link prediction (LP) task for predicting missing entities only.
investigation of embedding techniques for KGs containing n-ary RAE [112] further improves m-TransH by complimentary model-
relational data (HKGs), we call it Hyper-Relational Knowledge ing the relatedness of values, which means the likelihood that two
Graph Completion (HKGC). Table 35 is a overview of several HKGC values co-participate in a common instance. The work [112] adds
technologies introduced in this paper. this relatedness loss with a weighted hyper-parameter to the
embedding loss of m-TransH and learns the relatedness metric
5.3.1. Definition of facts in hyper-relational KG from RAE. When we return to the two issues m-TransH suf-
Formally, a commonly used representation scheme for HKG’s fered, we find that RAE attempts to solve the first ‘‘relatedness
fact transforms a hyper-relational fact into an n-ary representa- modeling’’ problem by taking the additional modeling of the
tion [112,291,292], i.e., a set of key–value (relation-entity) pairs relatedness of values into account. Although RAE surely achieves
rh : h, rt : t , k1 : v1 , . . . , kn , vn for the n-ary hyper-relational fact favorable performance which outperforms m-TransH, it does not
(h, r , t). A simple n-ary fact example and its n-ary representation consider the roles explicitly when evaluating the above likelihood
are shown in Fig. 31. Specifically, by this formula definition, a [292], whereas roles are also a fundamental aspect for complex
relation (binary or n-ary relation) is defined by the mappings relation modeling and taking them into consideration may make
from a roles sequence corresponding to this type of relation, to a difference because, under different sequences of roles (corre-
their values, and each specific mapping is an instance of this sponding to different relations), the relatedness of two values
relation [291]. Each hyper-relational fact (h, r , t) with (ki , vi ), i = tends to be greatly different. Taking an example from [292], Marie
1, . . . , n is firstly associated with a meta-relation represented as Curie and Henri Becquerel will be taken more related under the
an ordered list of keys (relations), such as R := (rh , rt , k1 , . . . , kn ), role sequence (person, aw ard, point in time, together w ith), than
the fact is then represented as a list of ordered values associated under the role sequence (person, spouse, start time, end time,
with the over-mentioned meta-relation as: {R, (h, t , v1 , . . . , vn )}. place of marriage) due to they won Nobel Prize in Physics in 1903
However, this form of hyper-relational fact pattern (as a set together.
of key–value pairs without triplets) treats each key–value pair For the second problem, RAE learns from the pairwise relat-
in the fact equally, which is not compatible with the schema edness between entities in each n-ary relational data to perform
used by modern KGs [8]. To avoid the wastage of essential in- instance reconstruction, i.e., predicting one or multiple missing
formation in triples, Rosso et al. [8] decides to preserve the entities [8]. Similar to m-TransH, RAE can only be used to perform
original triple schema of n-ary relational data, i.e., it contains LP.
a base triple (h, r , t) and a set of associated key–value pairs NaLP [292] whereby designs a relatedness evaluation module to
(ki , vi ), i = 1, . . . , n, while a commonly triple fact only contains explicitly model the relatedness of the role-value (i.i.e, key–value
a triple (h, r , t). In other words, this definition emphasizes the or relation-entity) pairs involved in the same n-ary relational
non-negligible characteristic of basic triplet structure even in fact via a neural network pipeline, which supports the prediction
hyper-relational fact sets. of either a missing key (relation) or a missing value (entity).
Until now, the above-mentioned two concerned problems are
5.3.2. Specific HKGC models all solved by [292]. In summary, m-TransH, RAE, and NaLP pay
Base on the over-mentioned hyper-relational fact representa- attention to the set of key–value pairs of an n-ary fact, resulting
tion, we further discuss the HKGC models meticulously. in suboptimal models.
m-TransH [291] is an earlier work that focuses on HKGs concern- HINGE [8] aims to directly learn from hyper-relational facts by
ing the n-ary relations (so-called multi-fold relations), it models not only distilling primary structure information from triple data
56
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 36
Statistic of popular hyper-relational datasets.
Dataset Entity Relation #Train #Valid #Test
Binary N-ary Overall Binary N-ary Overall Binary N-ary Overall
JF17K 28,645 322 44,210 32,169 76,379 – – – 10,417 14,151 24,568
WikiPeople1 47,765 707 270,179 35,546 305,725 33,845 4378 38,223 33,890 4391 38,281
WikiPeople2 34 839 375 280 520 7389 287 918 – – – 36 597 971 37 586

but also extracting further useful information from correspond- tails) in some facts, Rosso et al. [8] further filters out these non-
ing key–value pairs simultaneously. HINGE also applies a neural entity literals and the corresponding facts. Table 36 involves the
network framework equipped with convolutional structures, just main statistics of these two versions of WikiPeople datasets. Each
likes the network of [292]. of these datasets contains both triple facts and hyper-relational
facts.
5.3.3. Negative sampling about hyper-relational data
Performance Comparison of HKGC Models: To get an understand-
A commonly adopted negative sampling process on HKGs is
ing of the HKGC performance of existing models, we refer to
randomly corrupting one key or value in a true fact. For example,
the newest public KGC results for learning from hyper-relational
for an n-ary relational fact representation {rh : h, rt : t , ki : vi }, i =
1, . . . , n, when corrupting the key rh by a randomly sampled facts in [8] (shown in Table 37). We observe that HINGE [8]
rh′ (r , r ′ ), the negative fact becomes {rh′ : h, rt : t , ki : vi }, i = consistently outperforms all other models when learning hyper-
1, . . . , n. However, this negative sampling process is not fully relational facts, even performs better than the best-performing
adaptable to its n-ary representation of hyper-relational facts, it baseline NaLP-Fix [292], which shows a 13.2% improvement on
is unrealistic in especial for keys rh and rt , as rh′ is not compatible the link prediction (LP) task, and a 15.1% improvement on the
with rt while only one relation r (or r ′ ) can be assumed between relation prediction (RP) task on WikiPeople (84.1% and 23.8% on
h and t in a hyper-relational fact [8]. Therefore, an improved JF17K, respectively). Also, from Table 37 we can see NaLP shows
negative sampling method is proposed to fix this issue in [8]. better performance than m-TransH and RAE, since it learns the
Specifically, when corrupting the key rh by a randomly sampled relatedness between relation-entity pairs while m-TransH and
rh′ (r , r ′ ), the novel negative sampling approach also corrupts rt by RAE learn from entities only.
rt′ , resulting in a negative fact {rh′ : h, rt′ : t , ki : vi }, i = 1, . . . , n. Moreover, Rosso et al. [8] noted that m-TransH and RAE result
Subsequently for this negative fact, only a single relation r ′ links in very low performance on WikiPeople, which may be probably
h and t. Similarly, when corrupting rt , we also corrupt rh in the due to the weak presence of hyper-relational facts in WikiPeo-
same way. This new process is more realistic than the original ple while m-TransH and RAE are coincidentally designed for
one. hyper-relational facts. Besides, it is obvious that NaLP-Fix (with a
fixed negative sampling process) consistently shows better per-
5.3.4. Performance analysis of HKGC models formance compared to NaLP, with a slight improvement of 2.8% in
head/tail prediction, and a tremendous improvement of 69.9% in
Datasets: As we have discussed, the hyper-relational data is one
RP on WikiPeople (10.4% and 15.8% on JFK17K, respectively), this
natural fact style in KGs. For uniformly modeling and learning, a
result verifies the effectiveness of fixed negative sampling process
KG usually is represented as a set of binary relational triples by
proposed in [8], in particular for RP.
decomposing n-ary relational facts into multiple triples relying on
In addition, the baseline methods learning from
adding virtual entities, such as in Freebase, a so-called particular
hyper-relational facts (i.e., m-TransH, RAE, NaLP and NaLP-Fix)
‘‘star-to-clique’’ (S2C) conversion procedure to transform non-
surprisingly yield worse performance in many cases than the
binary relational data into binary triplets on filtered Freebase data
best-performing baseline which learns from triples only [8]. They
[291]. Since such procedures have been verified to be irreversible
further explain that the ignorance of the triple structure results
[291], so that it causes a loss of structural information in the
in this subpar performance, because the triple structure in KGs
multi-fold relations, in other words, this kind of transformed tra-
preserves essential information for KGC.
ditional triple datasets are no longer adaptable to n-ary relational
fact learning. Therefore the special datasets for HKGs embedding
and completion are built as follows: 6. Discussion and outlook

JF17K [291] extracts from Freebase. After removing the entities 6.1. Discussion about KGC studies
involved in very few triples and the triples involving String,
Enumeration Type, and Numbers, JF17K recovers a fact repre- According to a series of systematic studies about recently KGC
sentation from the remained triples. During fact recovering, it works, we discuss several major lights as follows:
firstly removes facts from meta-relations which have only one 1. About Traditional KGC Models: With the KGC technology
single role. Then JF17K randomly selects 10 000 facts from each going to be mature, the traditional translation model, decom-
meta-relation containing more than 10 000 facts. According to position model and neural network model in this field tend to
two instance representation strategies, JF17K further constructs become more and more commonly used as baseline KGC tools to
two instance representations Tid (F ) and T (F ) where F means the integrate other technologies for promising efficient and effective
resulting fact representation from previous steps. Next, the final KGC research.
dataset is built by further applying filtering on Tid (F ) and T (F ) 2. About Optimization Problem: It is absolutely necessary to
into G, Gid , randomly splitting along with original instance repre- pay attention to the optimization method. A proper optimization
sentation operation s2c(G). These resulting datasets are uniformly
method can make it faster or more accurately to get solution. The
called JF17K, we give their statistics in Table 36.
modeling of optimization objective also determines whether the
WikiPeople [292] extracts WikiPeople from Wikidata and fo- KGC problem has a global or local optimal solution, or in some
cuses on entities of type human without any specific filtering cases, it can improve the situation that is easy to fall into the local
to improve the presence of hyper-relational facts. The original optimal solution (suboptimal solution), which is not conducive to
WikiPeople dataset version in [292] also contains literals (used as the KGC task.
57
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

Table 37
The performance of several HKGC methods on WikiPeople and JF17K [8].
Method WikiPeople JF17K
Head/Tail prediction Relation prediction Head/Tail prediction Relation prediction
MRR Hit@10 Hit@1 MRR Hit@10 Hit@1 MRR Hit@10 Hit@1 MRR Hit@10 Hit@1
m-TransH 0.0633 0.3006 0.0633 N/A 0.206 0.4627 0.206 N/A
RAE 0.0586 0.3064 0.0586 N/A 0.2153 0.4668 0.2153 N/A
NaLP 0.4084 0.5461 0.3311 0.4818 0.8516 0.3198 0.2209 0.331 0.165 0.6391 0.8215 0.5472
NaLP-Fix 0.4202 0.5564 0.3429 0.82 0.9757 0.7197 0.2446 0.3585 0.1852 0.7469 0.8921 0.6665
HINGE 0.4763 0.5846 0.4154 0.95 0.9977 0.9159 0.4489 0.6236 0.3611 0.9367 0.9894 0.9014

3. About Regularization and Constraints: During a specific of enriching the knowledge of the internal KG with external
model learning, proper regularization and constraints, as well information, which in turn feeds back the training information
as the skills of super-parameter tuning can make the trained to the encoding side module based on both external informa-
model achieves unexpected results. Although this is an empirical tion and internal KG’s data while continuously replenishing the
work step even maybe with potential threatens (for example, ‘‘knowledge’’ of the KG.
N3 normalization [50] will require larger embedded dimensions, 2. Rule-based KGC is Promising. As are introduced in our pa-
some optimization techniques (e.g., Tucker [55]) may require a per, rule-based approaches perform very well and are a compet-
large number of parameters, and thus the resulting scalability or itive alternative to popular embedding models. For that reason,
economical issues need to be considered), we should attach im- they have promise to be included as a baseline for the evaluation
portant to the model tuning works. Relevant attention has been of KGC methods and it has been recommended that conducting
raised in previous works [50], officially doubting the question that the evaluation on a more fine-grained level is necessary and
whether the parameters are not adjusted well or the problem instructive for further study about KGC field in the future.
of the model itself should be responsible for a bad performance 3. Try the New PLMs is Feasible. Obviously, the endless new
needs to be studied and experimented continuously, emphasiz- pre-training language models (PLMs) make it unlimited possi-
ing that model tune-up works are as important as optimization bilities to combine effective language models with various text
model itself. information for obtaining high-quality embeddings and capturing
4. About Joint Learning Related to KGC: We conclude that abundant semantic information to complete KGs.
the joint KGC models that jointly learn distinct components tend 4. There is a Plenty of Scopes for Specific-Fields-KGC. The
to develop their energy function in a composition form. The emergence of new KGs in various specific fields stimulate the
Joint KGC methods usually extend the original definition of triple completion research on the specific field KGs. Although the exist-
energy (distance energy, similarity energy, etc.) to consider the ing KGC works concerning the KGs for specific fields and demands
new multimodality representations. is yet relatively rare (for example, there are only a few or a
5. About Information Fusion Strategies: We also conclude dozen of literature studying the completion of CommonSense KGs
several common experiences here. One of them is that when and Hyper-Relational KGs), KGC for specific field KGs is exactly
it comes to the internal combination of the same kind of in- meaningful with great practical application value, which will be
formation (such as collecting useful surrounding graph context further developed in the future.
as effective as possible for learning the proper neighbor aware 5. Capture Interaction between Distinct KGs will be Help-
representation, the combination between different paths of an ful to KGC. A series of tasks have emerged with the need of
entity pair, etc.), attention mechanism along with various neural interaction between various KGs, such as entity alignment, entity
network structure is an appropriate fusion strategy at the most disambiguation, attribute alignment and so on. When it comes
cases. Moreover, draw lessons from NLP field, RNN structure is to the multi-source knowledge fusion, the research of heteroge-
suitable for dealing with sequence problems. For example, when neous graph embedding (HGE) and multilingual Knowledge Graph
considering the path modeling, the general applied neural net- Embedding (MKGE) has gradually attracted much attention, which
work structure is RNN [96,166–168], and [163], as well as in the are not covered in our current review. KGC under multi-KGs
situation that utilizing textual information (especially the long interaction could evolve as a sub-direction for the future develop-
text sequence) for KGC. ment of KGC, which may create some inspiring ideas by studying
6. Embedding-based Reasoning and Rule-based Reasoning: the unified embedding and completion of different types and
As we have introduced and analyzed in our work, both rule-based structures of knowledge. By the way, the KGC work with respect
reasoning and embedding-based reasoning have their separate to multilingual KGs is insufficient, it is worth launching this
advantages and disadvantages. Under this case, researchers tend research direction to replenish the multilingual KGs demanded
to make the cooperation between these two kinds of KGC models in real applications.
expecting to exert both of their superiorities sufficiently. 6. Select More Proper Modeling Space. A novel opinion indi-
cates that modeling space of KG embedding does not have to be
6.2. Outlook on KGC limited in European space as most literatures do (TransE and its
extensions), on the contrary, as KGs possess an intrinsic charac-
We give the following outlooks depending on our observation teristic of presenting power-law (or scale-free) degree distribu-
and overview in this paper: tions as many other networks [293,294], there have been shown
1. A Deep-level Interaction is Beneficial for KGC. In the that scale-free networks naturally emerge in the hyperbolic space
aspect of adding additional information for KGC, especially those [295] . Recently, the hyperbolic geometry was exploited in vari-
extra information outside KGs, such as the rules and external text ous works [296–298] as a means to provide high-quality embed-
resources we mentioned, a peeping research trend is exploring a dings for hierarchical structures instead of in ordinary European
deep-level interactive learning between external knowledge and space. The work in [295] illustrated that hyperbolic space has
internal knowledge. That is, designing a model jointly with a the potential to perform significant role in the task of KGC since
combination of parameter sharing and information circulation, it offers a natural way to take the KG’s topological information
even employing an iterative learning manner to achieve the goal into account. This situation inspires researchers to explore more
58
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

effective and reasonable embedding vector space for KGC to CRediT authorship contribution statement
implement the basic translation transformation or tensor decom-
position of entities and relations, the expected model space could Tong Shen: Classification, Comparisons and analyses, Perfor-
be able to easily model complex types of entities and relations, mance evaluation, Writing – original draft, Revision responses.
along with various structural information. Fu Zhang: Classification, Writing – review & editing, Revision
7. Explore the Usage of RL in KGC. Reinforcement learning responses. Jingwei Cheng: Writing – review & editing.
(RL) has seen a variety of applications in NLP including machine
translation [299], summarization [300], and semantic parsing Declaration of competing interest
[301]. Compared to other applications, RL formulations in NLP and
KGs tend to have a large action space (e.g., in machine translation The authors declare that they have no known competing finan-
and KGC, the space of possible actions is the entire vocabulary of cial interests or personal relationships that could have appeared
a language and the whole neighbors of an entity, respectively) to influence the work reported in this paper.
[302]. On this basis, more recent work formulates multi-hop
reasoning as a sequential decision problem, and exploits rein- Acknowledgments
forcement learning (RL) to perform effective path search [63,141,
303,304]. Under normal circumstances, a RL agent is designed to
The authors sincerely thank the editors and the anonymous
find reasoning paths in the KG, which can control the properties
reviewers for their valuable comments and suggestions, which
of the found paths rather than using random walks as previous
improved the paper. The work is supported by the National
path finding models did. These effective paths not only can be
Natural Science Foundation of China (61672139) and the Funda-
used as an alternative to Path Ranking Algorithm (PRA) in many
mental Research Funds for the Central Universities, China (No.
path-based reasoning methods, but also mainly be treated as
reasoning formulas [303]. In particular, some recently studies N2216008).
apply human-defined reward functions, a foreseeable future is
to investigate the possibility of incorporating other strategies References
(such as adversarial learning [27]) to give better rewards than
[1] Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard
human-defined reward functions. On the other hand, a discrimi- de Melo, Claudio Gutierrez, José Emilio Labra Gayo, Sabrina Kirrane,
native model can be trained to give rewards instead of designing Sebastian Neumaier, Axel Polleres, et al., Knowledge graphs, 2020, arXiv
rewards according to path characteristics. Additionally, in the preprint arXiv:2003.02320.
future, RL framework can be developed to jointly reason with KG [2] Wanli Li, Tieyun Qian, Ming Zhong, Xu Chen, Interactive lexical and se-
triples and text mentions, which can help to address the prob- mantic graphs for semisupervised relation extraction, IEEE Trans. Neural
Netw. Learn. Syst. (2022).
lematic scenario when the KG does not have enough reasoning
[3] Ming Zhong, Yingyi Zheng, Guotong Xue, Mengchi Liu, Reliable keyword
paths. query interpretation on summary graphs, IEEE Trans. Knowl. Data Eng.
8. Multi-task learning about KGC. Multi-task learning (MTL) (2022).
[305] is attracting growing attention which inspires that the com- [4] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kon-
bined learning of multiple related tasks can outperform learning tokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey,
each task in isolation. With the idea of MTL, KGC can learn Patrick Van Kleef, Sören Auer, et al., Dbpedia–a large-scale, multilingual
knowledge base extracted from wikipedia, Semant. Web 6 (2) (2015)
and train with other KG-based tasks (or properly designed an-
167–195.
cillary tasks) by the MTL framework, which could gain both [5] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, Jamie Taylor,
representability and generalization by sharing the common in- Freebase: a collaboratively created graph database for structuring human
formation between the tasks in the learning process, to achieve knowledge, in: ACM SIGMOD International Conference on Management
overall performance. of Data, 2008, pp. 1247–1250.
[6] George A. Miller, WordNet: a lexical database for English, Commun. ACM
7. Conclusion 38 (11) (1995) 39–41.
[7] Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum, Yago: a core of
semantic knowledge, in: WWW, 2007, pp. 697–706.
With this overview, we tried to fill a research gap about a [8] Paolo Rosso, Dingqi Yang, Philippe Cudré-Mauroux, Beyond triplets:
systematic and comprehensive introduction of Knowledge Graph hyper-relational knowledge graph embedding for link prediction, in: The
Completion (KGC) works and shed new light on the insights Web Conference, 2020, pp. 1885–1896.
gained in previous years. We make up a novel full-view catego- [9] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin
Murphy, Thomas Strohmann, Shaohua Sun, Wei Zhang, Knowledge vault:
rization, comparison, and analyzation of research on KGC studies.
A web-scale approach to probabilistic knowledge fusion, in: ACM SIGKDD
Specifically, in the high-level, we review KGs in three major as- KDD, 2014, pp. 601–610.
pects: KGC merely with internal structural information, KGC with [10] Antoine Bordes, Xavier Glorot, Jason Weston, Yoshua Bengio, A semantic
additional information, and other special KGC studies. For the matching energy function for learning with multi-relational data, Mach.
first category, KGC is reviewed under Tensor/matrix factorization Learn. 94 (2) (2014) 233–259.
models, Translation models, and Neural Network models. For the [11] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston,
Oksana Yakhnenko, Translating embeddings for modeling multi-relational
second category, we further propose fine-grained taxonomies into
data, in: NIPS, 2013, pp. 1–9.
two views about the usage of inside information of KGs (including [12] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, Xuan Zhu, Learning
node attributes, entity-related information, relation-related infor- entity and relation embeddings for knowledge graph completion, in:
mation, neighbor information, and relational path information) or AAAI, Vol. 29, 2015.
outside information of KGs (including rule-based KGC and third- [13] Maximilian Nickel, Volker Tresp, Hans-Peter Kriegel, A three-way model
party data sources-based KGC). The third part pays attention to for collective learning on multi-relational data, in: ICML, 2011.
[14] Richard Socher, Danqi Chen, Christopher D. Manning, Andrew Ng, Rea-
other special KGC, such as CommonSense KGC, Temporal KGC, soning with neural tensor networks for knowledge base completion, in:
and Hyper-relational KGC. In particular, our survey provides a de- NIPS, Citeseer, 2013, pp. 926–934.
tailed and in-depth comparison and analysis of each KGC category [15] Zhen Wang, Jianwen Zhang, Jianlin Feng, Zheng Chen, Knowledge graph
in the fine-grained level and finally gives a global discussion and embedding by translating on hyperplanes, in: AAAI, Vol. 28, 2014.
prospect for the future research directions of KGC. This paper may [16] Quan Wang, Zhendong Mao, Bin Wang, Li Guo, Knowledge graph em-
bedding: A survey of approaches and applications, TKDE 29 (12) (2017)
help researchers grasp the main ideas and results of KGC, and
2724–2743.
to highlight an ongoing research on them. In the future, we will [17] Genet Asefa Gesese, Russa Biswas, Mehwish Alam, Harald Sack, A survey
design a relatively uniform evaluation framework and conduct on knowledge graph embeddings with literals: Which model links better
more detailed experimental evaluations. literal-ly?, 2019.

59
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

[18] Andrea Rossi, Denilson Barbosa, Donatella Firmani, Antonio Matinata, [47] Rudolf Kadlec, Ondrej Bajgar, Jan Kleindienst, Knowledge base com-
Paolo Merialdo, Knowledge graph embedding for link prediction: A pletion: Baselines strike back, 2017, arXiv preprint arXiv:1705.
comparative analysis, TKDD 15 (2) (2021) 1–49. 10744.
[19] Mayank Kejriwal, Advanced Topic: Knowledge Graph Completion, 2019. [48] Hitoshi Manabe, Katsuhiko Hayashi, Masashi Shimbo, Data-dependent
[20] Dat Quoc Nguyen, An overview of embedding models of entities and learning of symmetric/antisymmetric relations for knowledge base
relationships for knowledge base completion, 2017. completion, in: AAAI, Vol. 32, 2018.
[21] Hongyun Cai, Vincent W. Zheng, Kevin Chen-Chuan Chang, A comprehen- [49] Boyang Ding, Quan Wang, Bin Wang, Li Guo, Improving knowledge
sive survey of graph embedding: Problems, techniques, and applications, graph embedding using simple constraints, 2018, arXiv preprint arXiv:
TKDE 30 (9) (2018) 1616–1637. 1805.02408.
[22] Palash Goyal, Emilio Ferrara, Graph embedding techniques, applications, [50] Timothée Lacroix, Nicolas Usunier, Guillaume Obozinski, Canonical tensor
and performance: A survey, Knowl.-Based Syst. 151 (2018) 78–94. decomposition for knowledge base completion, in: ICML, PMLR, 2018, pp.
[23] Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, Philip S. 2863–2872.
Yu, A survey on knowledge graphs: Representation, acquisition and [51] Koki Kishimoto, Katsuhiko Hayashi, Genki Akai, Masashi Shimbo, Bina-
applications, 2020, arXiv preprint arXiv:2002.00388. rized canonical polyadic decomposition for knowledge graph completion,
[24] Heiko Paulheim, Knowledge graph refinement: A survey of approaches 2019, arXiv preprint arXiv:1912.02686.
and evaluation methods, Semant. Web 8 (3) (2017) 489–508. [52] Shuai Zhang, Yi Tay, Lina Yao, Qi Liu, Quaternion knowledge graph
[25] Baoxu Shi, Tim Weninger, Open-world knowledge graph completion, in: embeddings, 2019, arXiv preprint arXiv:1904.10281.
AAAI, Vol. 32, 2018. [53] Esma Balkir, Masha Naslidnyk, Dave Palfrey, Arpit Mittal, Using pairwise
[26] Agustín Borrego, Daniel Ayala, Inma Hernández, Carlos R. Rivero, David occurrence information to improve knowledge graph completion on
Ruiz, Generating rules to filter candidate triples for their correctness large-scale datasets, 2019, arXiv preprint arXiv:1910.11583.
checking by knowledge graph completion techniques, in: KCAP, 2019, [54] Ankur Padia, Konstantinos Kalpakis, Francis Ferraro, Tim Finin, Knowledge
pp. 115–122. graph fact prediction via knowledge-enriched tensor factorization, J. Web
[27] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Semant. 59 (2019) 100497.
Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative [55] Ledyard R. Tucker, Some mathematical notes on three-mode factor
adversarial networks, 2014, arXiv preprint arXiv:1406.2661. analysis, Psychometrika 31 (3) (1966) 279–311.
[28] Liwei Cai, William Yang Wang, Kbgan: Adversarial learning for knowledge [56] Tamara G. Kolda, Brett W. Bader, Tensor decompositions and applications,
graph embeddings, 2017, arXiv preprint arXiv:1711.04071. SIAM Rev. 51 (3) (2009) 455–500.
[29] Kairong Hu, Hai Liu, Tianyong Hao, A knowledge selective adversarial [57] Richard A. Harshman, Models for analysis of asymmetrical relationships
network for link prediction in knowledge graph, in: CCF NLPCC, Springer, among N objects or stimuli, in: First Joint Meeting of the Psychometric
2019, pp. 171–183. Society and the Society of Mathematical Psychology, Hamilton, Ontario,
[30] Jinghao Niu, Zhengya Sun, Wensheng Zhang, Enhancing knowledge graph 1978, 1978.
completion with positive unlabeled learning, in: ICPR, IEEE, 2018, pp. [58] Maximilian Nickel, Lorenzo Rosasco, Tomaso Poggio, Holographic
296–301. embeddings of knowledge graphs, in: AAAI, Vol. 30, 2016.
[31] Yanjie Wang, Rainer Gemulla, Hui Li, On multi-relational link prediction [59] Richard A. Harshman, Margaret E. Lundy, PARAFAC: Parallel factor
with bilinear models, in: AAAI, Vol. 32, 2018. analysis, Comput. Statist. Data Anal. 18 (1) (1994) 39–72.
[32] Kristina Toutanova, Danqi Chen, Observed versus latent features for [60] Daniel D. Lee, H. Sebastian Seung, Learning the parts of objects by
knowledge base and text inference, in: Proceedings of the 3rd Workshop non-negative matrix factorization, Nature 401 (6755) (1999) 788–791.
on Continuous Vector Space Models and their Compositionality, 2015, pp. [61] Ruslan Salakhutdinov, Nathan Srebro, Collaborative filtering in a non-
57–66. uniform world: Learning with the weighted trace norm, 2010, arXiv
[33] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel, preprint arXiv:1002.2780.
Convolutional 2d knowledge graph embeddings, in: AAAI, Vol. 32, 2018. [62] Shmuel Friedland, Lek-Heng Lim, Nuclear norm of higher-order tensors,
[34] Ke Tu, Peng Cui, Daixin Wang, Zhiqiang Zhang, Jun Zhou, Yuan Qi, Wenwu Math. Comp. 87 (311) (2018) 1255–1281.
Zhu, Conditional graph attention networks for distilling and refining [63] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan
knowledge graphs in recommendation, in: Proceedings of the 30th Durugkar, Akshay Krishnamurthy, Alex Smola, Andrew McCallum, Go for
ACM International Conference on Information & Knowledge Management, a walk and arrive at the answer: Reasoning over paths in knowledge
2021, pp. 1834–1843. bases using reinforcement learning, 2017, arXiv preprint arXiv:1711.
[35] Baoxu Shi, Tim Weninger, Discriminative predicate path mining for fact 05851.
checking in knowledge graphs, Knowl.-Based Syst. 104 (2016) 123–133. [64] Quan Liu, Hui Jiang, Andrew Evdokimov, Zhen-Hua Ling, Xiaodan Zhu, Si
[36] Shengbin Jia, Yang Xiang, Xiaojun Chen, Kun Wang, Triple trustworthiness Wei, Yu Hu, Probabilistic reasoning via deep learning: Neural association
measurement for knowledge graph, in: The World Wide Web Conference, models, 2016, arXiv preprint arXiv:1603.07704.
2019, pp. 2865–2871. [65] Saiping Guan, Xiaolong Jin, Yuanzhuo Wang, Xueqi Cheng, Shared embed-
[37] Ivana Balažević, Carl Allen, Timothy M. Hospedales, Tucker: Tensor ding based neural networks for knowledge graph completion, in: ACM
factorization for knowledge graph completion, 2019, arXiv preprint arXiv: CIKM, 2018, pp. 247–256.
1901.09590. [66] Feihu Che, Dawei Zhang, Jianhua Tao, Mingyue Niu, Bocheng Zhao,
[38] Rodolphe Jenatton, Nicolas Le Roux, Antoine Bordes, Guillaume Obozinski, Parame: Regarding neural network parameters as relation embeddings
A latent factor model for highly multi-relational data, in: NIPS), 2012, pp. for knowledge graph completion, in: Proceedings of the AAAI Conference
3176–3184. on Artificial Intelligence, Vol. 34, 2020, pp. 2774–2781.
[39] Alberto Garcia-Duran, Antoine Bordes, Nicolas Usunier, Yves Grandvalet, [67] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, Nilesh Agrawal, Partha
Combining two and three-way embeddings models for link prediction in Talukdar, Interacte: Improving convolution-based knowledge graph em-
knowledge bases, 2015, arXiv preprint arXiv:1506.00999. beddings by increasing feature interactions, in: AAAI, Vol. 34, 2020, pp.
[40] Hanxiao Liu, Yuexin Wu, Yiming Yang, Analogical inference for 3009–3016.
multi-relational embeddings, in: ICML, PMLR, 2017, pp. 2168–2178. [68] Tu Dinh Nguyen Dai Quoc Nguyen, Dat Quoc Nguyen, Dinh Phung, A novel
[41] Yi Tay, Anh Tuan Luu, Siu Cheung Hui, Falk Brauer, Random semantic embedding model for knowledge base completion based on convolutional
tensor ensemble for scalable knowledge graph link prediction, in: WSDM, neural network, in: NAACL-HLT, 2018, pp. 327–333.
2017, pp. 751–760. [69] Dai Quoc Nguyen, Thanh Vu, Tu Dinh Nguyen, Dat Quoc Nguyen, Dinh
[42] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, Li Deng, Embed- Phung, A capsule network-based embedding model for knowledge graph
ding entities and relations for learning and inference in knowledge bases, completion and search personalization, 2018, arXiv preprint arXiv:1808.
2014, arXiv preprint arXiv:1412.6575. 04122.
[43] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, Guillaume [70] Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne Van
Bouchard, Complex embeddings for simple link prediction, in: ICML, Den Berg, Ivan Titov, Max Welling, Modeling relational data with graph
PMLR, 2016, pp. 2071–2080. convolutional networks, in: ESWC, Springer, 2018, pp. 593–607.
[44] Seyed Mehran Kazemi, David Poole, Simple embedding for link prediction [71] Chao Shang, Yun Tang, Jing Huang, Jinbo Bi, Xiaodong He, Bowen Zhou,
in knowledge graphs, 2018, arXiv preprint arXiv:1802.04868. End-to-end structure-aware convolutional networks for knowledge base
[45] ABM Moniruzzaman, Richi Nayak, Maolin Tang, Thirunavukarasu Bal- completion, in: AAAI, Vol. 33, 2019, pp. 3060–3067.
asubramaniam, Fine-grained type inference in knowledge graphs via [72] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, Partha Talukdar,
probabilistic and tensor factorization methods, in: WWW, 2019, pp. Composition-based multi-relational graph convolutional networks, in:
3093–3100. International Conference on Learning Representations, 2019.
[46] Sameh K. Mohamed, Nová Vít, TriVec: Knowledge Graph Embeddings [73] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Janvin, A
for Accurate and Efficient Link Prediction in Real World Application neural probabilistic language model, J. Mach. Learn. Res. 3 (2003)
Scenarios. 1137–1155.

60
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

[74] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray [103] Qizhe Xie, Xuezhe Ma, Zihang Dai, Eduard H. Hovy, An interpretable
Kavukcuoglu, Pavel Kuksa, Natural language processing (almost) from knowledge transfer model for knowledge base completion, in: ACL (1),
scratch, J. Mach. Learn. Res. 12 (ARTICLE) (2011) 2493–2537. 2017.
[75] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Efficient estimation [104] Wei Qian, Cong Fu, Yu Zhu, Deng Cai, Xiaofei He, Translating embeddings
of word representations in vector space, 2013, arXiv preprint arXiv: for knowledge graph completion with relation attention mechanism, in:
1301.3781. IJCAI, 2018, pp. 4286–4292.
[76] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Imagenet clas- [105] Jun Yuan, Neng Gao, Ji Xiang, TransGate: knowledge graph embedding
sification with deep convolutional neural networks, NIPS 25 (2012) with shared gate structure, in: AAAI, Vol. 33, 2019, pp. 3100–3107.
1097–1105. [106] Xiaofei Zhou, Qiannan Zhu, Ping Liu, Li Guo, Learning knowledge embed-
[77] Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton, Dynamic routing dings by combining limit-based scoring loss, in: ACM on CIKM, 2017, pp.
between capsules, 2017, arXiv preprint arXiv:1710.09829. 1009–1018.
[78] Joan Bruna, Wojciech Zaremba, Arthur Szlam, Yann LeCun, Spectral [107] Mojtaba Nayyeri, Sahar Vahdati, Jens Lehmann, Hamed Shariat Yazdi, Soft
networks and locally connected networks on graphs, 2014. marginal transe for scholarly knowledge graph completion, 2019, arXiv
[79] Na Li, Zied Bouraoui, Steven Schockaert, Ontology completion using graph preprint arXiv:1904.12211.
convolutional networks, in: ISWC, Springer, 2019, pp. 435–452. [108] Han Xiao, Minlie Huang, Yu Hao, Xiaoyan Zhu, TransA: An adaptive
[80] David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael approach for knowledge graph embedding, 2015, arXiv preprint arXiv:
Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, Ryan P. Adams, 1509.05490.
Convolutional networks on graphs for learning molecular fingerprints, [109] Takuma Ebisu, Ryutaro Ichise, Toruse: Knowledge graph embedding on a
2015, arXiv preprint arXiv:1509.09292. lie group, in: AAAI, Vol. 32, 2018.
[81] Aditya Grover, Aaron Zweig, Stefano Ermon, Graphite: Iterative generative [110] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, Jian Tang, RotatE: Knowledge
modeling of graphs, in: ICML, PMLR, 2019, pp. 2434–2444. graph embedding by relational rotation in complex space, in: ICLR, 2018.
[82] Thomas N. Kipf, Max Welling, Semi-supervised classification with graph [111] Ruobing Xie, Zhiyuan Liu, Fen Lin, Leyu Lin, Does william shakespeare
convolutional networks, 2016, arXiv preprint arXiv:1609.02907. really write hamlet? knowledge representation learning with confidence,
[83] Xiangyu Song, Jianxin Li, Qi Lei, Wei Zhao, Yunliang Chen, Ajmal in: AAAI, Vol. 32, 2018.
Mian, Bi-CLKT: Bi-graph contrastive learning based knowledge tracing, [112] Richong Zhang, Junpeng Li, Jiajie Mei, Yongyi Mao, Scalable instance
Knowl.-Based Syst. 241 (2022) 108274. reconstruction in knowledge bases via relatedness affiliated embedding,
[84] Xiangyu Song, Jianxin Li, Yifu Tang, Taige Zhao, Yunliang Chen, Ziyu Guan, in: WWW, 2018, pp. 1185–1194.
Jkt: A joint graph convolutional network based deep knowledge tracing, [113] Antoine Bordes, Jason Weston, Ronan Collobert, Yoshua Bengio, Learning
Inform. Sci. 580 (2021) 510–523. structured embeddings of knowledge bases, in: AAAI, Vol. 25, 2011.
[85] Albert T. Corbett, John R. Anderson, Knowledge tracing: Modeling the
[114] Ziqi Zhang, Effective and efficient semantic table interpretation using
acquisition of procedural knowledge, User Model. User-Adapt. Interact. 4
tableminer+, Semant. Web 8 (6) (2017) 921–957.
(4) (1994) 253–278.
[115] Yun Tang, Jing Huang, Guangtao Wang, Xiaodong He, Bowen Zhou, Or-
[86] Yaming Yang, Ziyu Guan, Jianxin Li, Wei Zhao, Jiangtao Cui, Quan Wang,
thogonal relation transforms with graph context modeling for knowledge
Interpretable and efficient heterogeneous graph convolutional network,
graph embedding, 2019, arXiv preprint arXiv:1911.04910.
IEEE Trans. Knowl. Data Eng. (2021).
[116] Nitin Bansal, Xiaohan Chen, Zhangyang Wang, Can we gain more from
[87] Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi Wu, Pathsim:
orthogonality regularizations in training deep networks? Adv. Neural Inf.
Meta path-based top-k similarity search in heterogeneous information
Process. Syst. 31 (2018) 4261–4271.
networks, Proc. VLDB Endow. 4 (11) (2011) 992–1003.
[117] Shengwu Xiong, Weitao Huang, Pengfei Duan, Knowledge graph embed-
[88] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
ding via relation paths and dynamic mapping matrix, in: International
Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative
Conference on Conceptual Modeling, Springer, 2018, pp. 106–118.
adversarial networks, 2014, arXiv preprint arXiv:1406.2661.
[118] Alberto Garcia-Duran, Mathias Niepert, Kblrn: End-to-end learning of
[89] Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, Seqgan: Sequence
knowledge base representations with latent, relational, and numerical
generative adversarial nets with policy gradient, in: AAAI, Vol. 31, 2017.
features, 2017, arXiv preprint arXiv:1709.04676.
[90] Zhiqing Sun, Shikhar Vashishth, Soumya Sanyal, Partha Talukdar, Yiming
[119] T. Yi, L.A. Tuan, M.C. Phan, S.C. Hui, Multi-task neural network for
Yang, A re-evaluation of knowledge graph completion methods, 2019,
non-discrete attribute prediction in knowledge graphs, in: CIKM’17, 2017.
arXiv preprint arXiv:1911.03903.
[120] Yanrong Wu, Zhichun Wang, Knowledge graph embedding with numeric
[91] Deepak Nathani, Jatin Chauhan, Charu Sharma, Manohar Kaul, Learning
attributes of entities, in: Workshop on RepL4NLP, 2018, pp. 132–136.
attention-based embeddings for relation prediction in knowledge graphs,
in: ACL, 2019, pp. 4710–4723. [121] Huaping Zhong, Jianwen Zhang, Zhen Wang, Hai Wan, Zheng Chen,
[92] Jeffrey Pennington, Richard Socher, Christopher D. Manning, Glove: Global Aligning knowledge and text embeddings by entity descriptions, in:
vectors for word representation, in: EMNLP, 2014, pp. 1532–1543. EMNLP, 2015, pp. 267–272.
[93] Byungkook Oh, Seungmin Seo, Kyong-Ho Lee, Knowledge graph [122] Ruobing Xie, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Image-embodied
completion by context-aware convolutional learning with multi-hop knowledge representation learning, 2016, arXiv preprint arXiv:1609.
neighborhoods, in: ACM CIKM, 2018, pp. 257–266. 07028.
[94] Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, Maosong Sun, Represen- [123] Hatem Mousselly-Sergieh, Teresa Botschen, Iryna Gurevych, Stefan
tation learning of knowledge graphs with entity descriptions, in: AAAI, Roth, A multimodal translation-based approach for knowledge graph
Vol. 30, 2016. representation learning, in: SEM, 2018, pp. 225–234.
[95] Minjun Zhao, Yawei Zhao, Bing Xu, Knowledge graph completion via [124] Pouya Pezeshkpour, Liyan Chen, Sameer Singh, Embedding multimodal
complete attention between knowledge graph and entity descriptions, relational data for knowledge base completion, in: EMNLP, 2018.
in: CSAE, 2019, pp. 1–6. [125] Ye Liu, Hui Li, Alberto Garcia-Duran, Mathias Niepert, Daniel Onoro-Rubio,
[96] Tehseen Zia, Usman Zahid, David Windridge, A generative adversarial David S. Rosenblum, MMKG: multi-modal knowledge graphs, in: ESWC,
strategy for modeling relation paths in knowledge base representation Springer, 2019, pp. 459–474.
learning, 2019. [126] Agustinus Kristiadi, Mohammad Asif Khan, Denis Lukovnikov, Jens
[97] Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, Jun Zhao, Knowledge graph Lehmann, Asja Fischer, Incorporating literals into knowledge graph
embedding via dynamic mapping matrix, in: ACL and IJCNLP (Volume 1: embeddings, in: ISWC, Springer, 2019, pp. 347–363.
Long Papers), 2015, pp. 687–696. [127] Kai-Wei Chang, Wen-tau Yih, Bishan Yang, Christopher Meek, Typed
[98] Hee-Geun Yoon, Hyun-Je Song, Seong-Bae Park, Se-Young Park, A tensor decomposition of knowledge bases for relation extraction, in:
translation-based knowledge graph embedding preserving logical prop- EMNLP, 2014, pp. 1568–1579.
erty of relations, in: NAACL: Human Language Technologies, 2016, pp. [128] Denis Krompaß, Stephan Baier, Volker Tresp, Type-constrained repre-
907–916. sentation learning in knowledge graphs, in: ISWC, Springer, 2015, pp.
[99] Kien Do, Truyen Tran, Svetha Venkatesh, Knowledge graph embedding 640–655.
with multiple relation projections, in: ICPR, IEEE, 2018, pp. 332–337. [129] Shiheng Ma, Jianhui Ding, Weijia Jia, Kun Wang, Minyi Guo, Transt:
[100] Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, Mark Johnson, STransE: a novel Type-based multiple embedding representations for knowledge graph
embedding model of entities and relationships in knowledge bases, in: completion, in: ECML PKDD, Springer, 2017, pp. 717–733.
HLT-NAACL, 2016. [130] Alexandros Komninos, Suresh Manandhar, Feature-rich networks for
[101] Jun Feng, Minlie Huang, Mingdong Wang, Mantong Zhou, Yu Hao, Xiaoyan knowledge base completion, in: ACL (Volume 2: Short Papers), 2017, pp.
Zhu, Knowledge graph embedding by flexible translation, in: KR, 2016, 324–329.
pp. 557–560. [131] Elvira Amador-Domínguez, Patrick Hohenecker, Thomas Lukasiewicz,
[102] Miao Fan, Qiang Zhou, Emily Chang, Fang Zheng, Transition-based knowl- Daniel Manrique, Emilio Serrano, An ontology-based deep learning ap-
edge graph embedding with relational mapping properties, in: PACLIC, proach for knowledge graph completion with fresh entities, in: DCAI,
2014, pp. 328–337. Springer, 2019, pp. 125–133.

61
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

[132] Zhiting Hu, Poyao Huang, Yuntian Deng, Yingkai Gao, Eric Xing, Entity [160] T. Denoeux, A k-nearest neighbor classification rule based on
hierarchy embedding, in: ACL and IJCNLP (Volume 1: Long Papers), 2015, Dempster-Shafer theory, IEEE Trans. Syst. Man Cybern. 25 (5) (1995)
pp. 1292–1300. 804–813.
[133] Shu Guo, Quan Wang, Bin Wang, Lihong Wang, Li Guo, Semantically [161] Antoine Bordes, Xavier Glorot, Jason Weston, Yoshua Bengio, Joint learn-
smooth knowledge graph embedding, in: ACL and IJCNLP (Volume 1: Long ing of words and meaning representations for open-text semantic parsing,
Papers), 2015, pp. 84–94. in: Artificial Intelligence and Statistics, PMLR, 2012, pp. 127–135.
[134] Jianxin Ma, Peng Cui, Xiao Wang, Wenwu Zhu, Hierarchical taxonomy [162] Yashen Wang, Yifeng Liu, Huanhuan Zhang, Haiyong Xie, Leveraging lexi-
aware network embedding, in: ACM SIGKDD KDD, 2018, pp. 1920–1929. cal semantic information for learning concept-based multiple embedding
[135] Hanie Sedghi, Ashish Sabharwal, Knowledge completion for generics representations for knowledge graph completion, in: APWeb and WAIM
using guided tensor factorization, Trans. Assoc. Comput. Linguist. 6 (2018) Joint International Conference on Web and Big Data, Springer, 2019, pp.
197–210. 382–397.
[136] Bahare Fatemi, Siamak Ravanbakhsh, David Poole, Improved knowledge [163] Wenpeng Yin, Yadollah Yaghoobzadeh, Hinrich Schütze, Recurrent one-
graph embedding using background taxonomic information, in: AAAI, Vol. hop predictions for reasoning over knowledge graphs, in: COLING, 2018,
33, 2019, pp. 3526–3533. pp. 2369–2378.
[137] Mikhail Belkin, Partha Niyogi, Laplacian eigenmaps and spectral tech- [164] Ni Lao, Tom Mitchell, William Cohen, Random walk inference and
niques for embedding and clustering, in: NIPS, Vol. 14, 2001, pp. learning in a large scale knowledge base, in: EMNLP, 2011, pp. 529–539.
585–591. [165] Matt Gardner, Tom Mitchell, Efficient and expressive knowledge base
[138] Sam T. Roweis, Lawrence K. Saul, Nonlinear dimensionality reduction by completion using subgraph feature extraction, in: EMNLP, 2015, pp.
locally linear embedding, Science 290 (5500) (2000) 2323–2326. 1488–1498.
[139] Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, Knowledge graph completion [166] Arvind Neelakantan, Benjamin Roth, Andrew McCallum, Compositional
with adaptive sparse transfer matrix, in: AAAI, Vol. 30, 2016. vector space models for knowledge base completion, in: ACL and the
[140] Zhiqiang Geng, Zhongkun Li, Yongming Han, A novel asymmetric em- IJCNLP (Volume 1: Long Papers), 2015, pp. 156–166.
bedding model for knowledge graph completion, in: ICPR, IEEE, 2018, [167] Rajarshi Das, Arvind Neelakantan, David Belanger, Andrew McCallum,
pp. 290–295. Chains of reasoning over entities, relations, and text using recurrent
[141] Muhao Chen, Yingtao Tian, Xuelu Chen, Zijun Xue, Carlo Zaniolo, On2vec: neural networks, in: EACL (1), 2017.
Embedding-based relation prediction for ontology population, in: SIAM [168] Xiaotian Jiang, Quan Wang, Baoyuan Qi, Yongqin Qiu, Peng Li, Bin Wang,
ICDM, SIAM, 2018, pp. 315–323. Attentive path combination for knowledge graph completion, in: ACML,
PMLR, 2017, pp. 590–605.
[142] Ryo Takahashi, Ran Tian, Kentaro Inui, Interpretable and compositional
relation learning by joint training with an autoencoder, in: ACL (Volume [169] Yelong Shen, Po-Sen Huang, Ming-Wei Chang, Jianfeng Gao, Modeling
1: Long Papers), 2018, pp. 2148–2159. large-scale structured relationships with shared memory for knowledge
base completion, in: Workshop on Representation Learning for NLP, 2017,
[143] Kelvin Guu, John Miller, Percy Liang, Traversing knowledge graphs in
pp. 57–68.
vector space, 2015, arXiv preprint arXiv:1506.01094.
[170] Kai Lei, Jin Zhang, Yuexiang Xie, Desi Wen, Daoyuan Chen, Min Yang,
[144] Atsushi Suzuki, Yosuke Enokida, Kenji Yamanishi, Riemannian TransE:
Ying Shen, Path-based reasoning with constrained type attention for
Multi-relational graph embedding in non-euclidean space, 2018.
knowledge graph completion, Neural Comput. Appl. (2019) 1–10.
[145] Zili Zhou, Shaowu Liu, Guandong Xu, Wu Zhang, On completing sparse
[171] Kristina Toutanova, Xi Victoria Lin, Wen-tau Yih, Hoifung Poon, Chris
knowledge base with transitive relation embedding, in: AAAI, Vol. 33,
Quirk, Compositional learning of embeddings for relation paths in
2019, pp. 3125–3132.
knowledge base and text, in: ACL (Volume 1: Long Papers), 2016, pp.
[146] Charalampos E. Tsourakakis, Fast counting of triangles in large real
1434–1444.
networks without counting: Algorithms and laws, in: ICDM, IEEE, 2008,
[172] Xixun Lin, Yanchun Liang, Fausto Giunchiglia, Xiaoyue Feng, Renchu Guan,
pp. 608–617.
Relation path embedding in knowledge graphs, Neural Comput. Appl. 31
[147] Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, Mark Johnson, Neighborhood
(9) (2019) 5629–5639.
mixture model for knowledge base completion, 2016, arXiv preprint
[173] Vivi Nastase, Bhushan Kotnis, Abstract graphs and abstract paths for
arXiv:1606.06461.
knowledge graph completion, in: SEM, 2019, pp. 147–157.
[148] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero,
[174] Yao Zhu, Hongzhi Liu, Zhonghai Wu, Yang Song, Tao Zhang, Repre-
Pietro Lio, Yoshua Bengio, Graph attention networks, 2017, arXiv preprint
sentation learning with ordered relation paths for knowledge graph
arXiv:1710.10903.
completion, 2019, arXiv preprint arXiv:1909.11864.
[149] Fanshuang Kong, Richong Zhang, Yongyi Mao, Ting Deng, Lena: Locality- [175] Batselem Jagvaral, Wan-Kon Lee, Jae-Seung Roh, Min-Sung Kim, Young-
expanded neural embedding for knowledge base completion, in: AAAI, Tack Park, Path-based reasoning approach for knowledge graph comple-
Vol. 33, 2019, pp. 2895–2902. tion using CNN-BiLSTM with attention mechanism, Expert Syst. Appl. 142
[150] Trapit Bansal, Da-Cheng Juan, Sujith Ravi, Andrew McCallum, A2n: At- (2020) 112960.
tending to neighbors for knowledge graph inference, in: ACL, 2019, pp. [176] Tao Zhou, Jie Ren, Matúš Medo, Yi-Cheng Zhang, Bipartite network
4387–4392. projection and personal recommendation, Phys. Rev. E 76 (4) (2007)
[151] Peifeng Wang, Jialong Han, Chenliang Li, Rong Pan, Logic attention based 046115.
neighborhood aggregation for inductive knowledge graph embedding, in: [177] Matt Gardner, Partha Talukdar, Bryan Kisiel, Tom Mitchell, Improving
AAAI, Vol. 33, 2019, pp. 7152–7159. learning and inference in a large knowledge-base using latent syntactic
[152] Weidong Li, Xinyu Zhang, Yaqian Wang, Zhihuan Yan, Rong Peng, cues, in: EMNLP, 2013, pp. 833–838.
Graph2Seq: Fusion embedding learning for knowledge graph completion, [178] Matt Gardner, Partha Talukdar, Jayant Krishnamurthy, Tom Mitchell,
IEEE Access 7 (2019) 157960–157971. Incorporating vector space similarity in random walk inference over
[153] Zhao Zhang, Fuzhen Zhuang, Hengshu Zhu, Zhiping Shi, Hui Xiong, knowledge bases, in: EMNLP, 2014, pp. 397–406.
Qing He, Relational graph neural network with hierarchical attention for [179] Paul J. Werbos, Backpropagation through time: what it does and how to
knowledge graph completion, in: Proceedings of the AAAI Conference on do it, Proc. IEEE 78 (10) (1990) 1550–1560.
Artificial Intelligence, Vol. 34, 2020, pp. 9612–9619. [180] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bah-
[154] Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, William Yang Wang, danau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, Learning phrase
One-shot relational learning for knowledge graphs, 2018, arXiv preprint representations using RNN encoder-decoder for statistical machine
arXiv:1808.09040. translation, in: EMNLP, 2014.
[155] Jiatao Zhang, Tianxing Wu, Guilin Qi, Gaussian metric learning for few- [181] Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, Michael Ringgaard,
shot uncertain knowledge graph completion, in: International Conference billion clues in 800 million documents: A web research corpus annotated
on Database Systems for Advanced Applications, Springer, 2021, pp. with freebase concepts. Google Research Blog, 11.
256–271. [182] Yadollah Yaghoobzadeh, Hinrich Schütze, Corpus-level fine-grained entity
[156] Sébastien Ferré, Link prediction in knowledge graphs with concepts of typing using contextual information, 2016, arXiv preprint arXiv:1606.
nearest neighbours, in: ESWC, Springer, 2019, pp. 84–100. 07901.
[157] Agustín Borrego, Daniel Ayala, Inma Hernández, Carlos R. Rivero, David [183] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Ruiz, CAFE: Knowledge graph completion using neighborhood-aware Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is all you need,
features, Eng. Appl. Artif. Intell. 103 (2021) 104302. in: NIPS, 2017.
[158] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, [184] Ali Sadeghian, Mohammadreza Armandpour, Patrick Ding, Daisy Zhe
End-to-end memory networks, 2015, arXiv preprint arXiv:1503.08895. Wang, Drum: End-to-end differentiable rule mining on knowledge graphs,
[159] Sébastien Ferré, Concepts de plus proches voisins dans des graphes 2019, arXiv preprint arXiv:1911.00055.
de connaissances, in: 28es Journées Francophones d’Ingénierie des [185] Quan Wang, Bin Wang, Li Guo, Knowledge base completion using
Connaissances IC 2017, 2017, pp. 163–174. embeddings and rules, in: IJCAI, 2015.

62
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

[186] Shangpu Jiang, Daniel Lowd, Dejing Dou, Learning to refine an automati- [214] Daphne Koller, Nir Friedman, Sašo Džeroski, Charles Sutton, Andrew
cally extracted knowledge base using markov logic, in: ICDM, IEEE, 2012, McCallum, Avi Pfeffer, Pieter Abbeel, Ming-Fai Wong, Chris Meek, Jennifer
pp. 912–917. Neville, et al., Introduction to Statistical Relational Learning, MIT Press,
[187] Jay Pujara, Hui Miao, Lise Getoor, William W. Cohen, Ontology-aware 2007.
partitioning for knowledge graph identification, in: AKBC Workshop, [215] Matthew Richardson, Pedro Domingos, Markov logic networks, Mach.
2013, pp. 19–24. Learn. 62 (1–2) (2006) 107–136.
[188] Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezny, Steven Schockaert, [216] William Yang Wang, Kathryn Mazaitis, William W. Cohen, Programming
Ondrej Kuzelka, Lifted relational neural networks: Efficient learning of with personalized pagerank: a locally groundable first-order probabilistic
latent relational structures, J. Artificial Intelligence Res. 62 (2018) 69–100. logic, in: CIKM, 2013, pp. 2129–2138.
[189] Ondřej Kuželka, Jesse Davis, Markov logic networks for knowledge base [217] Arvind Neelakantan, Quoc V. Le, Martin Abadi, Andrew McCallum, Dario
completion: A theoretical analysis under the MCAR assumption, in: UAI, Amodei, Learning a natural language interface with neural programmer,
PMLR, 2020, pp. 1138–1148. 2016, arXiv preprint arXiv:1611.08945.
[190] Yuyu Zhang, Xinshi Chen, Yuan Yang, Arun Ramamurthy, Bo Li, Yuan [218] Arvind Neelakantan, Quoc V. Le, Ilya Sutskever, Neural programmer:
Qi, Le Song, Efficient probabilistic logic reasoning with graph neural Inducing latent programs with gradient descent, 2015, arXiv preprint
networks, 2020, arXiv preprint arXiv:2001.11850. arXiv:1511.04834.
[191] Fan Yang, Zhilin Yang, William W. Cohen, Differentiable learning of [219] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein, Learning to
logical rules for knowledge base reasoning, 2017, arXiv preprint arXiv: compose neural networks for question answering, 2016, arXiv preprint
1702.08367. arXiv:1601.01705.
[192] Pouya Ghiasnezhad Omran, Kewen Wang, Zhe Wang, Scalable rule [220] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Dani-
learning via learning representation, in: IJCAI, 2018, pp. 2149–2155. helka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward
[193] Tim Rocktäschel, Deep prolog: End-to-end differentiable proving in Grefenstette, Tiago Ramalho, John Agapiou, et al., Hybrid computing using
knowledge bases, in: AITP 2017, 2017, p. 9. a neural network with dynamic external memory, Nature 538 (7626)
[194] Pasquale Minervini, Matko Bosnjak, Tim Rocktäschel, Sebastian Riedel, (2016) 471–476.
Towards neural theorem proving at scale, 2018, arXiv preprint arXiv: [221] William W. Cohen, Tensorlog: A differentiable deductive database, 2016,
1807.08204. arXiv preprint arXiv:1605.06523.
[195] Zhuoyu Wei, Jun Zhao, Kang Liu, Zhenyu Qi, Zhengya Sun, Guanhua Tian, [222] Leonid Boytsov, Bilegsaikhan Naidan, Engineering efficient and effective
Large-scale knowledge base completion: Inferring via grounding network non-metric space library, in: SISAP, Springer, 2013, pp. 280–293.
sampling over selected instances, in: CIKM, 2015, pp. 1331–1340. [223] Yu A. Malkov, Dmitry A. Yashunin, Efficient and robust approximate
[196] William Yang Wang, William W. Cohen, Learning first-order logic nearest neighbor search using hierarchical navigable small world graphs,
embeddings via matrix factorization, in: IJCAI, 2016, pp. 2132–2138. PAMI 42 (4) (2018) 824–836.
[224] Richard Evans, Edward Grefenstette, Learning explanatory rules from
[197] Shu Guo, Quan Wang, Lihong Wang, Bin Wang, Li Guo, Jointly embedding
noisy data, J. Artificial Intelligence Res. 61 (2018) 1–64.
knowledge graphs and logical rules, in: EMNLP, 2016, pp. 192–202.
[225] Guillaume Bouchard, Sameer Singh, Theo Trouillon, On approximate rea-
[198] Pengwei Wang, Dejing Dou, Fangzhao Wu, Nisansa de Silva, Lianwen Jin,
soning capabilities of low-rank vector spaces, in: AAAI Spring Symposia,
Logic rules powered knowledge graph embedding, 2019, arXiv preprint
Citeseer, 2015.
arXiv:1903.03772.
[226] Baoxu Shi, Tim Weninger, ProjE: Embedding projection for knowledge
[199] Shu Guo, Quan Wang, Lihong Wang, Bin Wang, Li Guo, Knowledge graph
graph completion, in: AAAI, Vol. 31, 2017.
embedding with iterative guidance from soft rules, in: AAAI, Vol. 32,
[227] Han Xiao, Minlie Huang, Lian Meng, Xiaoyan Zhu, SSP: semantic space
2018.
projection for knowledge graph embedding with text descriptions, in:
[200] Vinh Thinh Ho, Daria Stepanova, Mohamed Hassan Gad-Elrab, Evgeny
AAAI, Vol. 31, 2017.
Kharlamov, Gerhard Weikum, Learning rules from incomplete kgs using
[228] Xu Han, Zhiyuan Liu, Maosong Sun, Neural knowledge acquisition via
embeddings, in: ISWC, ceur. ws. org, 2018.
mutual attention between knowledge graph and text, in: AAAI, Vol. 32,
[201] Wen Zhang, Bibek Paudel, Liang Wang, Jiaoyan Chen, Hai Zhu, Wei Zhang,
2018.
Abraham Bernstein, Huajun Chen, Iteratively learning embeddings and
[229] Paolo Rosso, Dingqi Yang, Philippe Cudré-Mauroux, Revisiting text and
rules for knowledge graph reasoning, in: WWW, 2019, pp. 2366–2377.
knowledge graph joint embeddings: The amount of shared information
[202] Meng Qu, Jian Tang, Probabilistic logic neural networks for reasoning,
matters!, in: 2019 IEEE Big Data, IEEE, 2019, pp. 2465–2473.
2019, arXiv preprint arXiv:1906.08495.
[230] Bo An, Bo Chen, Xianpei Han, Le Sun, Accurate text-enhanced knowledge
[203] Jianfeng Du, Jeff Z. Pan, Sylvia Wang, Kunxun Qi, Yuming Shen, Yu Deng, graph representation learning, in: NAACL: Human Language Technologies,
Validation of growing knowledge graphs by abductive text evidences, in: Volume 1 (Long Papers), 2018, pp. 745–755.
AAAI, Vol. 33, 2019, pp. 2784–2791. [231] Teng Long, Ryan Lowe, Jackie Chi Kit Cheung, Doina Precup, Leveraging
[204] Christian Meilicke, Melisachew Wudage Chekol, Daniel Ruffinelli, Heiner lexical resources for learning entity embeddings in multi-relational data,
Stuckenschmidt, Anytime bottom-up rule learning for knowledge graph in: ACL (2), 2016.
completion, in: IJCAI, 2019, pp. 3137–3143. [232] Miao Fan, Qiang Zhou, Thomas Fang Zheng, Ralph Grishman, Distributed
[205] Jiangtao Ma, Yaqiong Qiao, Guangwu Hu, Yanjun Wang, Chaoqin Zhang, representation learning for knowledge graphs with entity descriptions,
Yongzhong Huang, Arun Kumar Sangaiah, Huaiguang Wu, Hongpo Zhang, Pattern Recognit. Lett. 93 (2017) 31–37.
Kai Ren, ELPKG: A high-accuracy link prediction approach for knowledge [233] Jiacheng Xu, Xipeng Qiu, Kan Chen, Xuanjing Huang, Knowledge graph
graph completion, Symmetry 11 (9) (2019) 1096. representation with jointly structural and textual encoding, in: IJCAI,
[206] Guanglin Niu, Yongfei Zhang, Bo Li, Peng Cui, Si Liu, Jingyang Li, Xiaowei 2017.
Zhang, Rule-guided compositional representation learning on knowledge [234] Michael Cochez, Martina Garofalo, Jérôme Lenßen, Maria Angela Pelle-
graphs, in: AAAI, Vol. 34, 2020, pp. 2950–2958. grino, A first experiment on including text literals in KGloVe, 2018, arXiv
[207] Luis Galárraga, Christina Teflioudi, Katja Hose, Fabian M. Suchanek, Fast preprint arXiv:1807.11761.
rule mining in ontological knowledge bases with AMIE +, VLDB J. 24 (6) [235] Nada Mimouni, Jean-Claude Moissinac, Anh Vu, Knowledge base com-
(2015) 707–730. pletion with analogical inference on context graphs, in: Semapro 2019,
[208] Christian Meilicke, Manuel Fink, Yanjie Wang, Daniel Ruffinelli, Rainer 2019.
Gemulla, Heiner Stuckenschmidt, Fine-grained evaluation of rule-and [236] Liang Yao, Chengsheng Mao, Yuan Luo, KG-BERT: BERT for knowledge
embedding-based systems for knowledge graph completion, in: ISWC, graph completion, 2019, arXiv preprint arXiv:1909.03193.
Springer, 2018, pp. 3–20. [237] Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu, Juanzi Li, Jian
[209] Yang Chen, Sean Goldberg, Daisy Zhe Wang, Soumitra Siddharth Johri, Tang, KEPLER: A unified model for knowledge embedding and pre-trained
Ontological pathfinding, in: International Conference on Management of language representation, 2019, arXiv preprint arXiv:1911.06136.
Data, 2016, pp. 835–846. [238] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
[210] Tim Rocktäschel, Matko Bosnjak, Sameer Singh, Sebastian Riedel, Low- Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, Roberta:
dimensional embeddings of logic, in: ACL 2014 Workshop on Semantic A robustly optimized bert pretraining approach, 2019, arXiv preprint
Parsing, 2014, pp. 45–49. arXiv:1907.11692.
[211] Tim Rocktäschel, Sameer Singh, Sebastian Riedel, Injecting logical back- [239] Daniel Daza, Michael Cochez, Paul Groth, Inductive entity representations
ground knowledge into embeddings for relation extraction, in: NAACL: from text via link prediction, in: Proceedings of the Web Conference 2021,
Human Language Technologies, 2015, pp. 1119–1129. 2021, pp. 798–808.
[212] Stephen Muggleton, et al., Stochastic logic programs, in: Advances in [240] Bo Wang, Tao Shen, Guodong Long, Tianyi Zhou, Ying Wang, Yi Chang,
Inductive Logic Programming, Vol. 32, Citeseer, 1996, pp. 254–264. Structure-augmented text representation learning for efficient knowledge
[213] Stephen Muggleton, Inductive Logic Programming, Vol. 38, Morgan graph completion, in: Proceedings of the Web Conference 2021, 2021, pp.
Kaufmann, 1992. 1737–1748.

63
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

[241] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christo- [267] Farzaneh Mahdisoltani, Joanna Biega, Fabian Suchanek, Yago3: A knowl-
pher Clark, Kenton Lee, Luke Zettlemoyer, Deep contextualized word edge base from multilingual wikipedias, in: CIDR, CIDR Conference,
representations, 2018, arXiv preprint arXiv:1802.05365. 2014.
[242] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, [268] Itsumi Saito, Kyosuke Nishida, Hisako Asano, Junji Tomita, Common-
Improving language understanding by generative pre-training, 2018. sense knowledge base completion and generation, in: CoNLL, 2018, pp.
[243] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, Bert: Pre- 141–150.
training of deep bidirectional transformers for language understanding, [269] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv
2018, arXiv preprint arXiv:1810.04805. Batra, C. Lawrence Zitnick, Devi Parikh, Vqa: Visual question answering,
[244] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhut- in: ICCV, 2015, pp. 2425–2433.
dinov, Quoc V. Le, Xlnet: Generalized autoregressive pretraining for [270] Andrej Karpathy, Li Fei-Fei, Deep visual-semantic alignments for
language understanding, 2019, arXiv preprint arXiv:1906.08237. generating image descriptions, in: CVPR, 2015, pp. 3128–3137.
[245] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, Dis- [271] Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, Jason We-
tributed representations of words and phrases and their compositionality, ston, Engaging image captioning via personality, in: IEEE/CVF CVPR, 2019,
2013, arXiv preprint arXiv:1310.4546. pp. 12516–12526.
[246] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, [272] Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, Richard
Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is all you need, Socher, Explain yourself! leveraging language models for commonsense
2017, arXiv preprint arXiv:1706.03762. reasoning, 2019, arXiv preprint arXiv:1906.02361.
[247] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, Qun [273] Pengcheng Yang, Lei Li, Fuli Luo, Tianyu Liu, Xu Sun, Enhancing topic-to-
Liu, ERNIE: Enhanced language representation with informative entities, essay generation with external commonsense knowledge, in: ACL, 2019,
in: Proceedings of the 57th Annual Meeting of the Association for pp. 2002–2012.
Computational Linguistics, 2019, pp. 1441–1451. [274] Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, Subham Biswas, Min-
[248] Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuan-Jing lie Huang, Augmenting end-to-end dialogue systems with commonsense
Huang, Zheng Zhang, CoLAKE: Contextualized language and knowledge knowledge, in: AAAI, Vol. 32, 2018.
embedding, in: Proceedings of the 28th International Conference on [275] Xiang Li, Aynaz Taheri, Lifu Tu, Kevin Gimpel, Commonsense knowledge
Computational Linguistics, 2020, pp. 3660–3670. base completion, in: ACL (Volume 1: Long Papers), 2016, pp. 1445–1455.
[249] Boran Hao, Henghui Zhu, Ioannis Paschalidis, Enhancing clinical bert [276] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli
embedding using a biomedical knowledge base, in: Proceedings of the Celikyilmaz, Yejin Choi, Comet: Commonsense transformers for automatic
28th International Conference on Computational Linguistics, 2020, pp. knowledge graph construction, 2019, arXiv preprint arXiv:1906.05317.
657–661. [277] Chaitanya Malaviya, Chandra Bhagavatula, Antoine Bosselut, Yejin Choi,
[250] Robert Logan, Nelson F. Liu, Matthew E. Peters, Matt Gardner, Sameer Commonsense knowledge base completion with structural and semantic
Singh, Barack’s wife hillary: Using knowledge graphs for fact-aware context, in: AAAI, Vol. 34, 2020, pp. 2925–2933.
language modeling, in: ACL, 2019, pp. 5962–5971. [278] Xuelu Chen, Muhao Chen, Weijia Shi, Yizhou Sun, Carlo Zaniolo, Em-
bedding uncertain knowledge graphs, in: AAAI, Vol. 33, 2019, pp.
[251] Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur
3363–3370.
Joshi, Sameer Singh, Noah A. Smith, Knowledge enhanced contextual
word representations, in: EMNLP-IJCNLP, 2019, pp. 43–54. [279] Yohan Chalier, Simon Razniewski, Gerhard Weikum, Joint reasoning for
multi-faceted commonsense knowledge, 2020, arXiv preprint arXiv:2001.
[252] Tingsong Jiang, Tianyu Liu, Tao Ge, Lei Sha, Sujian Li, Baobao Chang, Zhi-
04170.
fang Sui, Encoding temporal information for time-aware link prediction,
[280] Wentao Wu, Hongsong Li, Haixun Wang, Kenny Q. Zhu, Probase: A prob-
in: EMNLP, 2016, pp. 2350–2354.
abilistic taxonomy for text understanding, in: ACM SIGMOD International
[253] Rishab Goel, Seyed Mehran Kazemi, Marcus Brubaker, Pascal Poupart,
Conference on Management of Data, 2012, pp. 481–492.
Diachronic embedding for temporal knowledge graph completion, in:
[281] Robyn Speer, Joshua Chin, Catherine Havasi, Conceptnet 5.5: An open
AAAI, Vol. 34, 2020, pp. 3988–3995.
multilingual graph of general knowledge, in: AAAI, Vol. 31, 2017.
[254] Chenjin Xu, Mojtaba Nayyeri, Fouad Alkhoury, Hamed Yazdi, Jens
[282] Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bishan
Lehmann, Temporal knowledge graph completion based on time series
Yang, Justin Betteridge, Andrew Carlson, Bhanava Dalvi, Matt Gardner,
Gaussian embedding, in: ISWC, Springer, 2020, pp. 654–671.
Bryan Kisiel, et al., Never-ending learning, Commun. ACM 61 (5) (2018)
[255] Julien Leblay, Melisachew Wudage Chekol, Deriving validity time in
103–115.
knowledge graph, in: Companion Proceedings of the the Web Conference
[283] Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula,
2018, 2018, pp. 1771–1776.
Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, Yejin Choi,
[256] Shib Sankar Dasgupta, Swayambhu Nath Ray, Partha Talukdar, Hyte:
Atomic: An atlas of machine commonsense for if-then reasoning, in: AAAI,
Hyperplane-based temporally aware knowledge graph embedding, in:
Vol. 33, 2019, pp. 3027–3035.
EMNLP, 2018, pp. 2001–2011.
[284] Robert Speer, Catherine Havasi, ConceptNet 5: A large semantic network
[257] Yunpu Ma, Volker Tresp, Erik A. Daxberger, Embedding models for for relational knowledge, in: The People’s Web Meets NLP, Springer, 2013,
episodic knowledge graphs, J. Web Semant. 59 (2019) 100490. pp. 161–176.
[258] Alberto García-Durán, Sebastijan Dumančić, Mathias Niepert, Learning [285] Jastrzebski Stanislaw, Dzmitry Bahdanau, Seyedarian Hosseini, Michael
sequence encoders for temporal knowledge graph completion, 2018, arXiv Noukhovitch, Yoshua Bengio, Jackie Chi Kit Cheung, Commonsense mining
preprint arXiv:1809.03202. as knowledge base completion? A study on the impact of novelty, 2018,
[259] Timothée Lacroix, Guillaume Obozinski, Nicolas Usunier, Tensor decom- arXiv preprint arXiv:1804.09259.
positions for temporal knowledge base completion, 2020, arXiv preprint [286] Hugo Liu, Push Singh, ConceptNet—a practical commonsense reasoning
arXiv:2004.04926. tool-kit, BT Technol. J. 22 (4) (2004) 211–226.
[260] Rakshit Trivedi, Hanjun Dai, Yichen Wang, Le Song, Know-evolve: Deep [287] Damian Szklarczyk, John H. Morris, Helen Cook, Michael Kuhn, Ste-
temporal reasoning for dynamic knowledge graphs, in: ICML, PMLR, 2017, fan Wyder, Milan Simonovic, Alberto Santos, Nadezhda T. Doncheva,
pp. 3462–3471. Alexander Roth, Peer Bork, et al., The STRING database in 2017: quality-
[261] Woojeong Jin, He Jiang, Meng Qu, Tong Chen, Changlin Zhang, Pedro controlled protein–protein association networks, made broadly accessible,
Szekely, Xiang Ren, Recurrent event network: Global structure inference Nucleic Acids Res. (2016) gkw937.
over temporal knowledge graph, 2019, arXiv preprint arXiv:1904.05530. [288] Push Singh, Thomas Lin, Erik T. Mueller, Grace Lim, Travell Perkins,
[262] Zhen Han, Yuyi Wang, Yunpu Ma, Stephan Guünnemann, Volker Tresp, Wan Li Zhu, Open mind common sense: Knowledge acquisition from the
The graph hawkes network for reasoning on temporal knowledge graphs, general public, in: OTM Confederated International Conferences" on the
2020, arXiv preprint arXiv:2003.13432. Move to Meaningful Internet Systems", Springer, 2002, pp. 1223–1237.
[263] Jiapeng Wu, Meng Cao, Jackie Chi Kit Cheung, William L Hamilton, TeMP [289] Gabor Angeli, Christopher D. Manning, Philosophers are mortal: Inferring
Temporal Message Passing for Temporal Knowledge Graph Completion, the truth of unseen facts, in: CoNLL, 2013, pp. 133–142.
2020, arXiv preprint arXiv:2010.03526. [290] Jonathan Gordon, Benjamin Van Durme, Reporting bias and knowledge
[264] Michael D Ward, Andreas Beger, Josh Cutler, Matthew Dickenson, Cassy acquisition, in: Workshop on AKBC, 2013, pp. 25–30.
Dorff, Ben Radford, Comparing GDELT and ICEWS event data, Analysis 21 [291] Jianfeng Wen, Jianxin Li, Yongyi Mao, Shini Chen, Richong Zhang, On
(1) (2013) 267–297. the representation and embedding of knowledge bases beyond binary
[265] Aaron Schein, John Paisley, David M. Blei, Hanna Wallach, Bayesian pois- relations, in: IJCAI, 2016, pp. 1300–1307.
son tensor factorization for inferring multilateral relations from sparse [292] Saiping Guan, Xiaolong Jin, Yuanzhuo Wang, Xueqi Cheng, Link prediction
dyadic event counts, in: ACM SIGKDD KDD, 2015, pp. 1045–1054. on n-ary relational data, in: WWW, 2019, pp. 583–593.
[266] Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, Gerhard Weikum, [293] Michalis Faloutsos, Petros Faloutsos, Christos Faloutsos, On power-law
YAGO2: A spatially and temporally enhanced knowledge base from relationships of the internet topology, in: The Structure and Dynamics of
Wikipedia, Artificial Intelligence 194 (2013) 28–61. Networks, Princeton University Press, 2011, pp. 195–206.

64
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597

[294] Mark Steyvers, Joshua B. Tenenbaum, The large-scale structure of seman- [300] Romain Paulus, Caiming Xiong, Richard Socher, A deep reinforced model
tic networks: Statistical analyses and a model of semantic growth, Cogn. for abstractive summarization, 2017, arXiv preprint arXiv:1705.04304.
Sci. 29 (1) (2005) 41–78. [301] Kelvin Guu, Panupong Pasupat, Evan Zheran Liu, Percy Liang, From
[295] Prodromos Kolyvakis, Alexandros Kalousis, Dimitris Kiritsis, Hyperkg: language to programs: Bridging reinforcement learning and maximum
Hyperbolic knowledge graph embeddings for knowledge base completion, marginal likelihood, 2017, arXiv preprint arXiv:1704.07926.
2019, arXiv preprint arXiv:1908.04895. [302] Peng Lin, Qi Song, Yinghui Wu, Fact checking in knowledge graphs with
[296] Maximillian Nickel, Douwe Kiela, Learning continuous hierarchies in ontological subgraph patterns, Data Sci. Eng. 3 (4) (2018) 341–358.
the lorentz model of hyperbolic geometry, in: ICML, PMLR, 2018, pp. [303] Wenhan Xiong, Thien Hoang, William Yang Wang, Deeppath: A rein-
3779–3788. forcement learning method for knowledge graph reasoning, 2017, arXiv
[297] Octavian Ganea, Gary Bécigneul, Thomas Hofmann, Hyperbolic entailment preprint arXiv:1707.06690.
cones for learning hierarchical embeddings, in: ICML, PMLR, 2018, pp. [304] Yelong Shen, Jianshu Chen, Po-Sen Huang, Yuqing Guo, Jianfeng Gao,
1646–1655. Reinforcewalk: Learning to walk in graph with monte carlo tree search,
[298] Frederic Sala, Chris De Sa, Albert Gu, Christopher Ré, Representa- 2018.
tion tradeoffs for hyperbolic embeddings, in: ICML, PMLR, 2018, pp. [305] Rich Caruana, Multitask learning, Mach. Learn. 28 (1) (1997) 41–75.
4460–4469.
[299] Sumit Chopra Marc’Aurelio Ranzato, Michael Auli, Wojciech Zaremba,
Sequence level training with recurrent neural networks, 2015, CoRR
abs/1511.06732.

65

You might also like