A Comprehensive Overview of Knowledge Graph Completion
A Comprehensive Overview of Knowledge Graph Completion
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
article info a b s t r a c t
Article history: Knowledge Graph (KG) provides high-quality structured knowledge for various downstream
Received 24 March 2022 knowledge-aware tasks (such as recommendation and intelligent question-answering) with its unique
Received in revised form 24 July 2022 advantages of representing and managing massive knowledge. The quality and completeness of
Accepted 3 August 2022
KGs largely determine the effectiveness of the downstream tasks. But in view of the incomplete
Available online 12 August 2022
characteristics of KGs, there is still a large amount of valuable knowledge is missing from the KGs.
Keywords: Therefore, it is necessary to improve the existing KGs to supplement the missed knowledge. Knowledge
Knowledge Graph Completion (KGC) Graph Completion (KGC ) is one of the popular technologies for knowledge supplement. Accordingly,
Classification there has a growing concern over the KGC technologies. Recently, there have been lots of studies
Comparisons and analyses focusing on the KGC field. To investigate and serve as a helpful resource for researchers to grasp the
Performance evaluation main ideas and results of KGC studies, and further highlight ongoing research in KGC, in this paper,
Overview we provide a all-round up-to-date overview of the current state-of-the-art in KGC.
According to the information sources used in KGC methods, we divide the existing KGC methods
into two main categories: the KGC methods relying on structural information and the KGC methods
using other additional information. Further, each category is subdivided into different granularity for
summarizing and comparing them. Besides, the other KGC methods for KGs of special fields (including
temporal KGC, commonsense KGC, and hyper-relational KGC) are also introduced. In particular, we
discuss comparisons and analyses for each category in our overview. Finally, some discussions and
directions for future research are provided.
© 2022 Elsevier B.V. All rights reserved.
1. Introduction other less universal relations are even more lacking. Knowl-
edge Graph Completion (KGC) aims to predict and replenish
Knowledge Graphs (KGs) describe the concepts, entities, and the missing parts of triples. As one of a popular KGC research
their relations in a structured triple form, providing a better abil- direction, Knowledge Graph Embedding (KGE) (or Knowledge Graph
ity to organize, manage and understand the mass of information Representation Learning) has been proposed and quickly gained
on the world [1]. In recent years, KG plays an increasingly impor- massive attention. KGE embeds KG components (e.g. entities and
tant role in lots of knowledge-aware tasks, and especially brings relations) into continuous vector spaces to simplify the manipula-
vitality to intelligent question answering, information extraction, tion and preserve the inherent structure of the KG simultaneously
and other artificial intelligence tasks [1–3]. There are a number [10–15]. Recently, there have been lots of studies focusing on the
of large-scale KGs such as DBPedia [4], Freebase [5], WordNet [6], KGC field. To facilitate the research of the KGC task and follow
and YAGO [7] (as shown in Table 1), which have been widely ex- the development in the KGC field, more and more review articles
ploited in many knowledge-aware applications. Facts in these KGs to sort out and summarize the recent KGC technologies.
are generally represented in a form of triple: (subject, predicate, Accordingly, several previous overviews on the KGC tech-
object), which be regarded as the fundamental data structure of niques are provided:
KGs and preserves the essential semantic information of KGs [8]. • Wang et al. [16] make the most relevant review with re-
Although KGs are of great value in applications, they are spect to KGC studies from 2012 to 2016. They first coarsely
still characterized by incompleteness because a large amount of group KGE models according to their input data (the in-
valuable knowledge exists implicitly or misses in the KGs [1]. put data including facts only or incorporating additional
Some data indicate that the deficiency rate of some common basic information. The additional information in [16] involves en-
relations in the current large KGs was more than 70% [9], while tity types, relation paths, textual descriptions, logical rules,
and a slight mention of several other information, such as
∗ Corresponding author. entity attributes and temporal information). Then they fur-
E-mail address: [email protected] (F. Zhang). ther make finer-grained categorizations based on the above
https://doi.org/10.1016/j.knosys.2022.109597
0950-7051/© 2022 Elsevier B.V. All rights reserved.
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 2
Common KGC benchmarks and their attributes.
Benchmark Entity Relation #Training #Validation #Test
WN11 38 696 11 112 581 2609 10 544
WN18RR 40 493 11 86 835 3034 3134
FB13 75 043 13 316 232 5908 23 733
FB15k 14 951 1345 48 3142 50 000 59 071
FB15k-237 14 541 237 272 115 17 535 20 466
Fig. 2. The data example in Freebase. on. We give a demonstration to illustrate the data in the Freebase
as Fig. 2. Topic Miyazaki Hayao is a cartoonist in the field of cartoon
domain, but a director in mov ie domain. It can be seen that
generator is responsible for generating negative samples, while Freebase is a database consists of multiple domains expanded
the discriminator can use translation models to obtain the vector by topics, the graph structure of every topic is controlled by its
representation of entities and relations, then it scores the gen- type and type properties. Typically, the subset FB15k and FB13 of
erated negative triples and feeds related information back to the Freebase, as well as the improved FB15k-237 based on FB15k, are
generator to provide experience for its negative samples genera- generally used as experimental benchmarks for method detection
tion. Recently, there have appeared a series of negative sampling in KGC:
techniques based on GAN (e.g. [29–31]), relevant experiments (1) FB15k: FB15K is created by selecting the subset of entities
have shown that this kind of methods can obtain high-quality that are also involved in the Wikilinks database and that also
negative samples, which are conducive to classify triples correctly possess at least 100 mentions in Freebase [11]. In addition, FB15K
in the training process of knowledge representation model. removes reversed relations (where reversed relations like ‘!/peo-
(2) Ranking setting ple/person/nationality’ just reverses the head and tail compared
In link prediction (LP) task, the evaluation is carried out by to the relation ‘/people/person/nationality’). FB15k describes the
performing head prediction or tail prediction on all test triples, ternary relationship between synonymous sets, and the synonym
and computing for each prediction how the target entity ranks sets that appear in the verification set and testing set also ap-
against all the other ones. Generally, the model expects the tar- pear in the training set. Also, FB15k converts n-ary relations
get entity to yield the highest plausibility. When computing the represented with reification into cliques of binary edges, which
predicted ranks, two different settings, raw and filtered sce- greatly affected the graph structure and semantics [18]. FB15K
narios, are applied. Actually a prediction may have more than has 592,213 triples with 14,951 entities and 1345 relations which
one valid answer: taking an example with the tail predicting were randomly split as shown in Table 2.
for (Barack Obama, parent , Natasha Obama), a KGC model may (2) FB15k-237 is a subset of FB15k built by Toutanova and Chen
associate a higher score to Malia Obama than to Natasha Obama, [32], which is aroused to respond to the test leakage problem due
i.e., there may exist other predicted fact that has been contained to the presence of near-identical relations or reversed relations
in the KG, e.g. (Barack Obama, parent , Malia Obama). Depending FB15k suffering from. Under this background, FB15k-237 was
on whether valid answers should be considered acceptable or not, built to be a more challenging dataset by first selecting facts
two separate settings have been devised [18]: from FB15k involving the 401 largest relations and removing all
• Raw setting: in this case, valid entities outscoring the target equivalent or reverse relations. Then they ensured that none of
one are considered as mistakes [18]. Thus for a test fact (h, r , t) the entities connected in the training set are also directly linked
in a testing set, the raw rank rankh of the target head entity h is in the validation and testing sets for filtering away all trivial
computed as follows (analogous for the tail entity): triples [18].
rankh = |e ∈ E \{h} : s(e, r , t) > s(h, r , t)| + 1 • WordNet [6]: WordNet is a large cognitive linguistics based KG
ontology, also can be regarded as an English Dictionary knowl-
• Filtered setting: in this case, valid entities outscoring the target edge base, whose construction process considers the alphabetic
one are not considered as mistakes [18], they are filtered out order of words and further form semantic web of English words.
when computing the rank: for the test fact (h, r , t), the filtered In WordNet, entities (called synsets) correspond to semantics,
rank rankh of the target head entity h is computed as (analogous and relational types define the lexical relations between these
for the tail entity): semantics. Besides, WordNet not only contains multiple types of
words such as polysemy, categories classification, synonymy and
rankh = |e ∈ E \{h} : s(e, r , t) > s(h, r , t) ∧ (e, r , t) ∈
/ T|+1 antonymy, but also includes the entity descriptions. Furthermore,
there are various post-produced subset datasets extracted from
2.3. Datasets and evaluation metrics WordNet, such as WN11, WN18, and WN18RR:
(1) WN11: it includes 11 relations and 38 696 entities. What is
Here we introduce some most frequently used datasets for more, the train set, the validation set, and the test set of WN11
KGC (see Section 2.3.1) and several evaluation metrics for KGC contain 112 581, 2609, and 10 544 triples, respectively [11].
(see Section 2.3.2). (2) WN18: it uses WordNet as a starting point and then iteratively
filters out entities and relationships with too few mentions [11,
2.3.1. Datasets 18]. Note that WN18 involves reversible relations.
We describe the datasets mainly developed on two KGs: Free- (3) WN18RR: WN18RR is built by Dettmers et al. [33] for re-
base and WordNet, and report some of their important attributes
lieving test leakage issue in WN18 that test data being seen by
in Table 2.
models at training time. It is constructed by applying a pipeline
• Freebase: Freebase is a public KG, whose content is added similar to the one employed for FB15k-237 [32]. Recently, they
all by users. Moreover, Freebase also extracts knowledge from acknowledge that 212 entities in the testing set do not appear in
opening KGs as a supplement [26]. The fundamental data items in the training set, making it impossible to reasonably predict about
Freebase including ‘‘Topic’’, ‘‘Type’’, ‘‘Domain’’, ‘‘Property’’ and so 6.7% test facts.
4
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 3
Detailed computing formulas of evaluation metrics for KGC.
Metrics Computing formula Notation definition Task
1
∑i=1 1
MRR MRR = |Q | |Q | ranki Q : query sets; |Q |: queries numbers; LP, RP
ranki : the rank of the first correct answer for the ith query
1
∑i=1
MR MRR = |Q | |Q | ranki Q : query sets; |Q |: queries numbers; LP, RP
ranki : the rank of the first correct answer for the ith query
Hits@n Hits@n = |Q1 | Count(ranki ≤ n), Count(): the hit test number in the top n rankings among test LP, RP
0 < i ≤ |Q | examples;
Q : query sets; |Q |: queries numbers;
ranki : the rank of the first correct answer for the ith query
1
∑
MAP MAP = |Q | q∈Q APq APq : average precision of the query q; LP, RP
Q : query sets; |Q |: queries numbers
TP +TN
Accuracy Accuracy = TP +TN +FP +FN
TP: true positive; FP: false positive; TC
FN: false negative; TN: true negative
TP
Precision Precision = TP +FP
TP: true positive; TC
FP: false positive
TP
Recall Recall = TP +FN
TP: true positive; TC
FN: false negative
2∗Recall∗Precision
F1 score F1 = Recall+Precision
— TC
2.3.2. Evaluation metrics a good trade-off between completeness and correctness. Knowl-
In this section, we recommend evaluation metrics generally edge Graph Refinement (KGR) is proposed to infer and add miss-
used in KGC. Table 3 shows detailed computing formulas of these ing knowledge to the graph (i.e., KGC), and identify erroneous
mentioned metrics. pieces of information (i.e., error detection) [24]. Recently, KGR
is incorporated into recommender systems [34]. Tu et al. [34]
Mean Reciprocal Rank (MRR): MRR is widely used in the ranking
exploit the KG to capture target-specific knowledge relationships
problem which tends to return multiple results, such as LP and RP
in recommender systems by distilling the KG to reserve the useful
task for KGC. When dealing with such problems, the evaluation
information and refining the knowledge to capture the users’
system will rank the results by their scores from high to low.
preferences.
MRR evaluates a ranking algorithm according to its ranking of the
Basically, KGC is one of the KGR subtasks to conduct inference
target answer. The higher the target answer ranks, the better the
and prediction of missing triples. Error detection (e.g., [35,36]) is
ranking algorithm. In a formulaic view, for a query, if the target
another KGR subtask for identifying errors in KGs. Jia et al. [36]
answer ranks nth, then the MRR score is calculated as 1n (if there
establish a knowledge graph triple trustworthiness measurement
is no target answer among returned results, the score is 0).
model that quantifies the semantic correctness of triples and
Mean-Rank (MR) and Hits@n: Similar to MRR and generally used the true degree of the triples expressed. But note that KGC is a
in the Top-K ranking problem, MR and Hits@n are common met- relatively independent task to increase the coverage of KGs for
rics in KGC evaluation, especial in LP and RP tasks. MR represents alleviating the incompleteness of KGs. In our current overview,
the average ranks of target entity (or relation) in the testing set; we focus on the KGC techniques, and the issues about KGR can
Hits@n (usually, n = 1, 3, 10) indicates the proportion in the refer to [24,34].
testing set that predicted target entities (or relations) ranks in
the top n. The ranks are computed according to each prediction’s 2.5. Our categorization principle
scoring.
Accuracy: Accuracy refers to the ratio of correctly predicted The main full-view categorization of our review on KGC stud-
triples to the total predicted triples, it usually is applied to eval- ies is shown in Fig. 3.
uate the quality of classification models in TC task for KGC, its To follow the experienced rapid development of KGC models,
calculation formula is demonstrated in Table 3. we provide wide coverage on emerging researches for advanced
KGC technologies. We include the main literature since the begin-
Other evaluation metrics: There are other evaluation metrics ning of KGC research as comprehensive as possible and take care
for KGC tasks, such as Mean Average Precision (MAP) pays of the far-reaching and remarkable approaches in detail. We di-
attention to the relevance of returned results in ranking problem. vide KGC methods into two main categories according to whether
Some metrics closely related to ‘‘accuracy’’ in measuring the using additional information: Structure (triple) information-
classification problems, like ‘‘recall’’, ‘‘precision’’ and ‘‘F1 score’’. based KGC methods and Additional information-based KGC
Compared with MR, MRR, Hits@n, and ‘‘accuracy’’, these metrics methods (the additional information typically refers to some
are not continually employed in the field of KGC. The detailed other information that included inside or outside of KGs except
computing formulas of these mentioned metrics can be found in for the structure information, such as text description, artificial
Table 3. rules). Moreover, we further consider the source of additional
information — depending on whether it comes from the inner
2.4. Knowledge Graph Refinement (KGR) vs. KGC KG, we classify the additional information into two finer subclasses:
internal side information inside KGs and external extra in-
The construction process of large-scale KGs results that the formation outside KGs. In addition, we introduce some KGC
formalized knowledge in KGs cannot reasonably reach both ‘‘full techniques targeting certain fields, like Temporal Knowledge
coverage’’ and ‘‘fully correct’’ simultaneously. KGs usually need Graph Completion (TKGC), CommonSense KGC (CSKGC) and
5
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Hyper-relational KGC (HKGC). We also make a detailed compar- (2) External extra information outside KGs outside KGs, mainly
ison and summary among the methods of each small category. including two aspects: rule-based KGC and third-party data
We give a global discussion and prospect for the future research sources-based KGC.
directions of KGC. Specifically, our categorization principle is as • Other KGC technologies: we take additional attention on some
follows: other KGC techniques, such as Temporal KGC, CommonSense
• Structure information-based KGC methods: which only use KGC and Hyper-relational KGC.
the structure information of internal facts in KGs. For this cat-
egory, KGC is reviewed under semantic matching models and 3. Structural information-based KGC technologies
translation models according to the nature of their scoring func-
tions. The semantic matching models generally use semantic In this section, we focus on KGC technologies relying on struc-
matching-based scoring functions and further consists of ten- ture information only, give an account of several categories of
sor/matrix factorization models and neural network models. The methods belonging to this kind of KGC technologies: Semantic
translation models apply distance-based scoring function; Matching models in Section 3.1 and Translation models in Sec-
• Additional information-based KGC methods: which cooperate tion 3.2.
with additional information (the inside or outside information of
KGs except for the structure information) to achieve KGC. For this 3.1. Semantic matching models
category, we further propose fine-grained taxonomies respective
into two views about the usage of inside information or outside Semantic Matching models is a kind of models which com-
information: pute semantic matching-based scoring functions by measuring the
(1) Internal side information inside KGs involved in KGs, includ- semantic similarities of entity or relation embeddings in latent
ing node attributes information, entity-related information, relation- embedding space. In this category, we introduce two subclasses:
related information, neighborhood information, relational path infor- Tensor/Matrix Factorization Models (see Section 3.1.1) and Neural
mation; Network Models (see Section 3.1.2).
6
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 4
Characteristics of Tensor Factorization (TF) KGC methods.
Model Highlight Score function Loss functiona Parameters & Constrains
Tucker-based TF methods
TuckER [37] Tucker decomposition, s(h, r , t) = W ×1 vh ×2 vr ×3 vt Bernoulli Llog W ∈ Rde ×dr ×de ,
multi-task learning vh , vt ∈ Rde , vr ∈ Rdr
DEDICOM-based TF methods
RESCAL [13] Three-way bilinear TF s(h, r , t) = vhT Mr vt L2 vh , vt ∈ Rd , Mr ∈ Rd×d
LFM [38] Bilinear TF, s(h, r , t) ≜ Llog Rj ∈ Rp×p ,
decomposing the yT Mr y′ + vt T Mr z + z ′T Mr vt + vt T Mr vt , y, y′ , z , z ′ ∈ Rp
j
ur , vr ∈ Rp , α j ∈ Rd
∑d
r =1 αr Θr = ur vr
relation matrix Rj , T
Rj =
decreasing parameters
of RESCAL
Tatec [39] 2-way and 3-way s(h, r , t) = s1 (h, r , t) + s2 (h, r , t) Lmarg vhi , vti ∈ Rdi , i = 1, 2
interactions models, s1 (h, r , t) = vrT1 vh1 + vrT2 vt1 + vhT Ddiag vt1 +∆soft /hard vr1 , vr2 ∈ Rd1 ,
1
hard regularization, s2 (h, r , t) = vhT M r vt2 M r ∈ R d2 × d2
2
soft regularization
ANALOGY [40] Bilinear TF, s(h, r , t) = vhT Mr vt Llogistic h, t ∈ Rd , Mr ∈ Rd×d
normality relation matrix Mr MrT = MrT Mr , ∀r ∈ R,
commutativity relation matrix Mr Mr ′ = Mr ′ Mr , ∀r ∈ R.
REST [41] Subgraph tensors building, for quary (h, r , ?) : ve = vhT Mr A L2 vh , vt ∈ Rd , Mr ∈ Rd×d
RW-based SGS, s(h, r , t) = vhT Mr vt A ∈ RNe ×d
predicate sparsification operator,
Focused Link Prediction (FLP)
CP-based TF methods
DistMult [42] RESACL + diagonal matrices s(h, r , t) = vhT Mr diag vt max Lmarg Mr diag = diag(r), r ∈ Rd
ComplEx [43] Complex values s(h, r , t) = Re(v T
v¯
h Mr diag t ) Lnll + ∆L2 vh , vt ∈ Cd ,
Mr diag = diag(vr ), vr ∈ Cd
CP-based TF model
∑d−1
= Re( i=0 [vr ]i · [vh ]i · [v¯t ]i )
3.1.1. Tensor/matrix factorization models KGC, the relational data can be represented as a {0, 1}-valued
Here we introduce a series of Tensor Factorization (TF) models third-order tensor Y ∈ {0, 1}|E |×|R|×|E | , if the relation (h, r , t)
in detail and make a summary table (Table 4) for conveniently is true there meets Yh,r ,t = 1, and the corresponding three
exhibiting the characteristics of these models. Recently, tensors modes properly stand for the subject mode, the predicate mode
and their decompositions are widely used in data mining and and the object mode respectively. TF algorithms aim to infer
machine learning problems [13]. In KG field, the large-scale ten- a predicted tensor X ∈ R|E |×|R|×|E | that approximates Y in a
sor factorization has been paid more and more attention for KGC sense. Validation/test queries (?, r , t) are generally answered by
tasks. ordering candidate entities h′ through decreasing values of Xh′ ,r ,t ,
Based on a fact that KG can be represented as tensors (shown yet queries (h, r , ?) are answered by ordering entities t ′ with de-
in Fig. 4), KGC can be framed as a 3rd-order binary tensor comple- creasing values of Xh,r ,t ′ . In that context, numerous literature have
tion problem, tensors can also be regarded as a general method considered link prediction as a low-rank tensor decomposition
to replace common methods, such as graphical models [50]. For problem.
7
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 4 (continued).
Model Highlight Score function Loss functiona Parameters & Constrains
∑d−1
N3 regularizer [50] CP + p-norms regularizer s(h, r , t) = sCP = i=0 [vr ]i · [vh ]i · [vt ]i Lnll + ∆N3 vh , vt ∈ Rd ,∑
vr ∈ R∑
d
χ ≈ Z ×1 A ×2 B ×3 C ,
where ×n denotes the tensor product along the nth mode. Factor However, Kolda and Bader [56] indicated that Tucker de-
matrices A, B and C can be considered as the principal compo- composition is not unique because the core tenser Z can be
nents in each mode if they are orthogonal. Typically, since M1 , transformed without affecting the fit if we conduct the inverse
M2 , M3 are smaller than N1 , N2 , N3 respectively, thus Z can be transformation to A, B and C .
regarded as a compressed version of χ , whose elements express
the interaction level between various components. 3.1.1.2. Decomposition into directional components (DEDICOM)-
based TF methods. Contrary to Tucker decomposition, the rank-r
TuckER [37] based on Tucker decomposition to the binary tensor
DEDICOM decomposition [57] is capable of detecting correlations
representation, it is a powerful linear model with fewer parame-
ters but obtains consistent good results, this is because it enables between multiple interconnected nodes, which can be captured
multi-task learning across relations. By modeling the binary ten- through singly or synthetically considering the attributes, rela-
sor representation of a KG according to Tucker decomposition as tions, and classes of related entities during a learning process.
Fig. 5, TuckER defines the score function as: DEDICOM decomposes a three-way tensor χ as:
where vhi , vti are embeddings of head and tail entities in Rdi space
Fig. 6. The there-way tensor model for relation data [13].
(i = 1, 2), vr1 , vr2 are vectors in Rd1 , while M r ∈ Rd2 ×d2 is a
mapping matrix, and D is a diagonal matrix that is independent
of the input triple. Depending on whether jointly update (or
for the latent components and its variation in the third mode can fine-tune) the parameters of 2-way and 3-way score terms in a
be described by diagonal factors [13]. second phase, Tatec proposes two term combination strategies to
RESCAL [13] is an early three-way DEDICOM-based model for effectively combine the bigram and trigram scores, fine tuning
KGC, which interprets the inherent structure of dyadic relational (Tatec-ft) and linear combination (Tatec-lc), the former simply
data. By employing a three-way tensor χ (as shown in Fig. 6), adding s1 term and s2 term and fine-tuned overall parameters in
where two modes are identically formed by the concatenated en- s, while the latter combines twos in a linear way. Besides, Tatec
tity vectors of the domain and the third mode holds the relations attempts hard regularization or soft regularization for the Tatec-ft
matrix in the domain. The score of a fact (h, r , t) is defined by a optimization problem.
bilinear function:
ANALOGY [40] is an extended version of RESCAL, it is interested in
s(h, r , t) = vhT Mr vt , (1) explicitly modeling analogical properties of both entity and rela-
tion embeddings, applies a bilinear score function used in RESCAL
where vh , vt ∈ Rd are entity embeddings, and Mr ∈ Rd×d is an (shown in formula (1)) but further stipulates the relation mapping
asymmetric matrix associated with the relation that models the
matrices must to be normal as well as mutually commutative as:
interactions between latent factors.
normality : Mr MrT = MrT Mr , ∀r ∈ R
LFM [38] is a bilinear TF model extending RESCAL, to overcome
the relational data growing issue thus to model large multi-
relational datasets. Similar to RESCAL, LFM embeds entities in commutativ ity : Mr Mr ′ = Mr ′ Mr , ∀r ∈ R
d−dimension vectors, encodes each relation into a matrix Mr j as
The relation matrices can be simultaneously
a bilinear operators among the entities, where 1 ≤ j ≤ Nr , Mr ∈
Rd×d . For efficiently modeling large relational factor, (h, r , t), LFM block-diagonalized into a set of sparse almost-diagonal matrices,
first redefines the previous linear score items as the following each decomposed matrix equips O(d) free parameters. Besides,
form to take account of the different interaction order including ANALOGY carries out the training process by formulating a differ-
unigram, bigram, and trigram orders between h, t and r: entiable learning objective, thus allows it to exhibit a favorable
theoretical power and computational scalability. Relevant evi-
s(h, r , t) ≜ yT Mr y′ + vh T Mr z + z ′T Mr vt + vh T Mr vt (2) dence has shown that multiple TF methods, such as DistMult [42],
HolE [58], and ComplEx [43] that will be mentioned later can be
where the parameters y, y , z , z ∈ R , which participate in the
′ ′ d
regarded as special cases of ANALOGY in a principled manner.
calculation yMr y′ , vh T Mr z + z ′T Mr vt terms, together with vh T Mr vt ,
these three terms represents uni-, bi- and trigram orders of REST [41] has fast response speed and good adaptability to evolve
interactions between h, t and r. The another improvement on data and yet obtains comparable or better performance than
RESCAL is decomposing the relation matrix Mr over a set of p-rank other previous TF approaches. Based on the TF model, REST uses
matrices Θr (1 ≤ r ≤ p) with: Random Walk (RW)-based semantic graph sampling algorithm
p
∑ (SGS) and predicate sparsification operator to construct Ensem-
Mr = αrj Θr (3) ble Components, which samples a large KG tensor in its graph
r =1 representation to build diverse and smaller subgraph tensors (the
Ensemble Architecture as Fig. 7), then uses them in conjunction
where Θr = ur wrT for ur , wr ∈ Rd , α j ∈ Rp . The Θr constrained
for focused link prediction (FLP) task. Experimental results show
by the outer product operator efficiently decreases the number of
that FLP and SGS are helpful to reduce the search space and
the overall parameters compared with the general relation matrix
noise. In addition, the predicate sparsification can improve the
parameterization process in RESCAL, which greatly speeds up the
prediction accuracy. REST can deliver results on demand, which
computations relying on traditional linear algebra. LFM normal-
makes it more suitable for the dynamic and evolutionary KGC
izes the terms appearing in formulas (2) and (3) by minimizing
field.
the negative log-likelihood over a specific constraint set.
Tatec [39] cooperates with both 2-way and 3-way interactions 3.1.1.3. CANDECOM/PARAFAC (CP)-based TF methods. The most
models to capture different data patterns in respective embed- well known canonical tensor decomposition method relevant to
ding space, which obtains a better performance outstripping the KGC field might be the CANDECOM/PARAFAC (CP) [59], in which
best of either constituent. Different from the closest relative a tensor χ ∈ RN1 ×N2 ×N3 was represented as a sum of R rank one
(1) ⨂ (2) ⨂ (3)
model LFM, Tatec combines the 3-way model and constrained 2- tensors xr xr xr , thus:
way model but pre-trains them separately. Tatec learns distinct R
∑
embeddings and relation parameters for the 2-way and the 3- χ= x(1) (2) (3)
r · xr · xr
way interaction terms so that it avoids the problem of reducing r =1
9
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
d
∑
sCP = [vh ]i · [vr ]i · [vh ]i
i=1
11
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 5
Statistic about experimental results of TF models on WN18 and FB15K. We use the bold and italic to mark the scores ranking first and second under the same
metrics respectively.
WN18 FB15K
MRR Hits@1 Hits@3 Hits@10 MR MRR Hits@1 Hits@3 Hits@10 MR
RESCAL [13] 0.89 0.842 0.904 0.928 – 0.354 0.235 0.409 0.587 –
DistMult [42] 0.83 – – 0.942 – 0.35 – – 0.577 –
Single DistMult [47] 0.797 – – 0.946 655 0.798 – – 0.893 42.2
Ensemble DistMult [47] 0.79 – – 0.95 457 0.837 – – 0.904 35.9
ComplEx [43] 0.941 0.936 0.945 0.947 – 0.692 0.599 0.759 0.84 –
ComplEx w/std L1 [48] 0.943 0.94 0.945 0.948 – 0.711 0.618 0.783 0.856 –
ComplEx w/mul L1 [48] 0.943 0.94 0.946 0.949 – 0.733 0.643 0.803 0.868 –
ComplEx-NNEc [49] 0.941 0.937 0.944 0.948 – 0.727 0.659 0.772 0.845 –
ComplEx-NNE+AERc [49] 0.943 0.94 0.945 0.948 – 0.803 0.761 0.831 0.874 –
ANALOGY [48] 0.942 0.939 0.944 0.947 – 0.725 0.646 0.785 0.854 –
RESCAL + TransE [31] 0.873 – – 0.948 510 0.511 – – 0.797 61
RESCAL + HolE [31] 0.94 – – 0.944 743 0.575 – – 0.791 165
HolE + TransE [31] 0.938 – – 0.949 507 0.61 – – 0.846 67
RESCAL + HolE + TransE [31] 0.94 – – 0.95 507 0.628 – – 0.851 52
SimplE [44] 0.942 0.939 0.944 0.947 – 0.727 0.66 0.773 0.838 –
ComplEx-N3-Sa [50] 0.95 – – 0.96 – 0.8 – – 0.89 –
CP [51] 0.942 0.939 0.945 0.947 – 0.72 0.659 0.768 0.829 –
CP-FRO-Rb [50] 0.95 – – 0.95 – 0.86 – – 0.91 –
CP-N3-Rb [50] 0.95 – – 0.96 – 0.86 – – 0.91 –
ComplEx-FRO-Rb [50] 0.95 – – 0.96 – 0.86 – – 0.91 –
ComplEx-N3-Rb [50] 0.95 – – 0.96 – 0.86 – – 0.91 –
B-DistMult [51] 0.841 0.761 0.915 0.944 – 0.672 0.558 0.76 0.854 –
B-CP [51] 0.945 0.941 0.948 0.956 – 0.733 0.66 0.793 0.87 –
QuatE [52] 0.949 0.941 0.954 0.96 388 0.77 0.7 0.821 0.878 41
QuatE-N3-R [52] 0.95 0.944 0.954 0.962 – 0.833 0.8 0.859 0.9 –
QuatE+TYPEc [52] 0.95 0.945 0.954 0.959 162 0.782 0.711 0.835 0.9 17
TuckER [37] 0.953 0.949 0.955 0.958 – 0.795 0.741 0.833 0.892 –
a
‘‘S’’ means the Standard learning.
b
‘‘R’’ denotes the Reciprocal learning.
c
‘‘NNE’’, ‘‘AER’’ and ‘‘TYPE’’ denote the non-negativity constraints, approximate entailment constraints [49] and the type constraints [52], respectively.
selects a larger rank scope for a more extensive search about op- lightweight models ComplEx and SimplE that are famous for
timization/regularization parameters, which are also the reasons simplicity and fewer parameters equipment, which is because the
for the good performance of it. Approaches that apply nuclear TuckER allows knowledge sharing between relations through the
3 − norm regularizer still show extraordinary talents in FB15K, but core tensor so that it supports multi-task learning. In comparison,
most of the improvements are statistically significant than those the same multi-task learning benefited ‘ComplEx-N3’ [50] forces
on WN18. parameter sharing between relations by ranking regularization
(2) Constraints on entities and relations: The results of of the embedding matrices to encourage a low-rank factoriza-
‘ComplEx-NNE+AER’ [49] demonstrate that imposing the non- tion, which uses the highly non-standard setting de = dr =
negativity and approximate entailment constraints respectively 2000 to generate a large number of parameters compared with
for entities and relations indeed improves KG embedding. In TuckER, resulting slightly lower grades than TuckER. Addition-
Table 5, ‘ComplEx-NNE’ and ‘ComplEx-NNE+AER’ perform better ally, both QuatE and TuckER also achieve remarkable results on
than (or as equally well as) ComplEx in WN18. We can find an FB15K, especially QuatE on Hist@1, outperforms state-of-the-art
interesting sight that by introducing these simple constraints, models while the second-best results scatter amongst TuckER
‘ComplEx-NNE+AER’ can beat strong baselines, including the best and ‘ComplEx-NNE+AER’. Unlike the constraints-used methods
performing basic models like ANALOGY and those previous exten- that target applying prior beliefs to shrink the solution space,
sions of ComplEx, but can be derived such axioms directly from QuatE achieves high grades relying on effectively capturing the
approximate entailments in [49]. Exerting proper constraints symmetry, antisymmetry, and inversion relation patterns, which
to the original linear TF models is also very helpful for KGC, take a large portion in both WN18 and FB15K. On FB15K, TuckER
just as in WN18, the constraints used ‘ComplEx-NNE+AER’ also obtains lackluster performance across MRR and Hits@10 metrics
out-performs ComplEx and other traditional TF models. but excesses on the toughest Hits@1 metric.
(3) Different dimension space modeling: In addition, the explo- (4) Considering hyper-parameters setting: It is notable that
rations of new tensor decomposition mode in different dimension on FB15K, the Ensemble DistMult also performs high results
space also achieve inspiring success. From Table 5 we can observe across both MRR and Hits@10, this is because it further improves
that on WN18, the quaternion-valued method QuatE performs DistMult only with proper hyper-parameters settings. This work
competitively compared to the existing state-of-the-art models helps us to solve the doubt: whether an algorithm was achieved
across all metrics and deservedly outperforms the representative due to a better model/algorithm or just by a more extensive
complex-valued basic model ComplEx, which is because that
hyper-parameter search. On the other hand, the good results of
quaternion rotation over the rotation in the complex plane has
DistMult reported in Ensemble DistMult also because of using a
advantages in modeling complex relations. Besides, the N3 regu-
large negative sampling size (i.e., 1000, 2000).
larization and reciprocal learning in QuatE or the type constraints
in QuatE also play an important role in QuatE’s success. Another b. Further Performance Verification:
eye-catching method TuckER takes account of the binary tensor We have analyzed the effects of many factors on performance,
representation of KGs, which outperforms almost all linear TF especially the effectiveness of constraints or regularization tech-
models along with their relevant extension versions on all metrics niques. To further evaluate the efficacy, we select the experimen-
in WN18. TuckER consistently obtains better results than those tal results evaluated on WN18RR and FB15K-237 for illustration.
13
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 6
Summarization and comparison of recent popular Neural Network models for KGC.
Model Technique Score Function Loss functiona Notation Datasets
Traditional neural network models:
[1:k]
NTN [14] Bilinear tensor layer g(h, r , t) = uT f (p1 + p2 + br ), max Lmarg Wr ∈ Rd×d×k , WordNet,
[1:k]
p1 = vhT Wr vt , p2 = Vr [vh , vt ]T Vr ∈ Rk×2d Freebase
f = tanh()
MLP [9] Improves NTN; s(h, r , t) = w T f (p1 + p2 + p3 ), – vh , vr , vt ∈ Rd , KV
Standard Multi-layer Perceptron p1 = M1 vh , p2 = M2 vr , p3 = M3 vt Mi ∈ Rd×d ,
w ∈ Rd ,
f = tanh()
NAM [64] Multi-layer nonlinear activations; s(h, r , t) = g(vtT u{L} ) Lll vh , vt , vr ∈ Rd , WN11,
probabilistic reasoning u{l} = f (W {l} u{l−1} + b{l} ) g = sigmoid(), FB13
u{0} = [vh , vr ] f = ReLU()
SENN [65] Embedding shared fully s(h, t) = vr ATR , s(r , t) = vh ATE , s(h, r) = vt ATE , Joint adaptively vh , vt , vr ∈ Rd , WN18,
connected neural network; vr = f (f (...f ([h; t ]Wr ,1 + br ,1 )...))Wr ,n + br ,n weighted loss AE ∈ R|E |×d , FB15K
adaptively weighted vh = f (f (...f ([r ; t ]Wh,1 + bh,1 )...))Wh,n + bh,n AR ∈ R|R|×d
loss mechanism vt = f (f (...f ([h; r ]Wt ,1 + bt ,1 )...))Wt ,n + bt ,n f = ReLU()
ParamE [66] MLP; CNN; gate structure; s(h, r , t) = ((fnn (vh ; vr ))W + b)vt LBCE vh , vt , vr ∈ Rd , FB15k-237,
embed relations as NN parameters vr = Paramfnn W ∈ Rd×n , b ∈ Rd , WN18RR
g = sigmoid(),
f = ReLU()
CNN-based KGC models:
ConvE [33] Multi-layer 2D CNN; s(h, r , t) = f (v ec(f (concat(vˆh , vˆr ) ∗ Ω ))W ) · vt b LBCE vh , vt ∈ Rd , WN18,
1-N scoring programs vˆh , vˆr ∈ Rdw ×dh ; FB15k,
′
vr ∈ Rd , YAGO3-10,
d = dw dh ; Countries,
f = ReLU(); FB15k-237
Ω : filter sets
InteractE Feature Permutation; s(h, r , t) = g(v ec(f (φ (Pk ) ◦ w ))W )vt c LBCE vh , vt , vr ∈ Rd , FB15K-237,
[67] Checkered Reshaping; Pi = [(vh1 , vr1 ); ...; (vhi , vri )] d = dw dh; WN18RR,
Circular Convolution f = ReLU(), YAGO3-10
g = sigmoid();
w: a filter
ConvKB [68] 1D CNN; s(h, r , t) = concat(g([vh , vr , vt ] ∗ Ω )) · W b Lnll vh , vt , vr ∈ Rd WN18RR,
Transitional characteristic; g = ReLU(), FB15k-237
L2 regularization Ω : filter sets;
CapsE [69] ConvKB; s(h, r , t) = ∥cap(g([vh , vr , vt ] ∗ Ω ))∥b Lnll vh , vr , vt ∈ Rd ; WN18RR,
capsules networks g = ReLU(); FB15k-237
Ω : filter sets;
cap() : Capsule-
Network
GCN-based KGC Models:
R-GCN [70] Basis decomposition; s(h, r , t) = vhT Wr vt LBCE vh , vt ∈ Rd ; WN18RR,
block-diagonal-decomposition; Wr ∈ Rd×d FB15k,
end-to-end framework: FB15k-237
encoder: R-GCN,
decoder: DistMult
SACN [71] End-to-end framework: s(h, r , t) = f (v ec(M(vh , vr ))W )vt – f = ReLU(); FB15k-237,
encoder: WGCN, W ∈ RCd×d ; WN18RR,
decoder: Conv-TransE M(vh , vr ) ∈ RC ×d ; FB15k-237
C : kernels number -Attr
COMPGCN Entity-relation- sConv E , sDistMult , etc. – – FB15k-237,
[72] composition operators; WN18RR
end-to-end framework:
encoder: COMPGCN,
decoder: ConvE, DistMult, etc.
recognizing textual entailment, especially responds well for com- dealing with diverse prediction tasks and various mapping styles
monsense reasoning. during the training process.
Shared Embedding based Neural Network (SENN) [65] explicitly ParamE [66] is an expressive and translational KGC model which
differentiates the prediction tasks of head-entities, relations, and regards neural network parameters as relation embeddings, while
the head entity embeddings and tail entity embeddings are re-
tail-entities by use of three respective substructures with fully-
garded as the input and output of this neural network respec-
connected neural networks in an embedding sharing manner. tively. To confirm whether ParamE is a general framework for dif-
Then the prediction-specific scores gained from substructures are ferent NN architectures, this paper designs three different NN ar-
employed to estimate the possibility of predictions. An adaptively chitectures to implement ParamE: multi-layer perceptrons (MLP),
weighted loss mechanism enables SENN to be more efficient in convolution layers, and gate structure layers, called ParamE-MLP,
15
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 6 (continued).
Model Technique Score Function Loss functiona Notation Datasets
GAN-based KGC Models:
KBGAN [28] Discriminator+generator; strans d Lmarg – FB15k-237,
negative sampling; WN18,
reinforcement learning WN18RR
IGAN [31] Discriminator+generator; strans or ssem d Lmarg – FB15K,
negative sampling; FB13,
reinforcement learning; WN11,
non-zero loss WN18
KSGAN [29] Discriminator + generator + ssem d Lmarg – FB15k-237,
selector; WN18,
negative sampling; WN18RR
reinforcement learning;
non-zero loss
a
Lll (Lnll ), Lmarg and LBCE are (negative) log likely-hood loss, margin-based ranking loss and binary cross entropy loss respectively.
b
‘∗’ means a convolution operator.
c
‘◦’ means depth-wise circular convolution.
d
‘strans ’ and ‘ssem ’ are respective the score function of translation models and semantic matching models.
ParamE-CNN, ParamE-Gate. Significantly, ParamE embeds the en- the basic nonlinear transformation function, rectified linear units,
tity and relation representations in feature space and parameter for faster training [76]. ConvE owns much fewer parameters but
space respectively, this makes entities and relations be mapped is significantly efficient when modeling high-scale KGs with high
into two different spaces as expected. degree node numbers. This work also points out the test set leak-
3.1.2.2. Convolutional Neural Network (CNN)-based KGC models. age issue of WN18 and FB15k datasets, performing a comparative
We summarize some CNN-based KGC methods and draw a re- experiment on their robust variants: WN18RR and FB15K-237.
lated figure (Fig. 10) for exhibiting the whole architecture of InteractE [67] further advances ConvE by increasing the captured
them, from which we can clearly know the learning procedure interactions to heighten LP’s performance. InteractE chooses a
of these models.
novel input style in a multiple permutation manner and re-
ConvE [33] describes a multi-layer 2D convolutional network places simple feature reshaping of ConvE with the checked re-
model for LP task, which is the first attempt that uses 2D convo- shaping. Additionally, its special circular convolution structure is
lutions over graph embeddings to explore more valuable feature performed in a depth-wise manner.
interactions. ConvE defines its score function by a convolution
over 2D shaped embeddings as: ConvKB [68] is proposed after ConvE—the main difference be-
tween ConvKB and ConvE is that ConvKB uses 1D convolution
s(h, r , t) = f (v ec(f ([v¯h ; v¯r ] ∗ ω))W )vt expecting to extract global relations over the same dimensional
where the relation parameter vr ∈ Rk , v¯h and v¯r represent the 2D entries of an input triple matrix, which indicated that ConvKB
reshaping of vh and vt respectively, which conform to: both the concerns at the transitional characteristics of triples. According to
v¯h , v¯r ∈ Rkw ×kh when vh , vr ∈ Rk , where k = kw kh in which the the evaluation on two benchmark datasets: WN18RR and FB15k-
kh , kw denotes the width and height of the reshaped 2D matrix. 237, ConvKB performs better grades compared with ConvE and
The v ec() means the vectorization operation, while f () indicates some other past models, which may be due to the efficient CNN
16
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
structure as well as the design for extracting the global rela- of the graph. On this basis, COMPGCN systematically leverages
tion information so that ConvKB will not ignore the transitional entity-relation composition operations from KGE techniques to
characteristics of triples in KGs. jointly embed entities and relations in a graph. Firstly, COMPGCN
alleviates the over-parameterization problem by performing KGE
CapsE After ConvKB, Nguyen et al. [69] next present CapsE to
composition (φ (u, r)) of a neighboring node u with respect to
model triples by employing the capsule network [77], a network
whose original intention is capturing entities in images. It is its relation r, to substitute the original neighbor parameter vu
the first attempt at applying a capsule network for KGC. The in the GCNs, therefore COMPGCN is relation-aware. Additionally,
general framework of CapsE is shown in Fig. 10(d), from which to ensure that COMPGCN scales with the increasing number of
we can see after feeding to a convolution layer with multiple relations, COMPGCN shares relation embeddings across layers
filters sets Ω as ConvKB dose, the 3-column triple matrix then is and uses basis decomposition based on the basis formulations
transformed into different feature maps, and these feature maps proposed in R-GCN. Different from R-GCN which defines a sep-
are later reconstructed by two capsule layers. A routing algorithm arate set of basis matrices for each GCN layer, COMPGCN defines
extended from Sabour et al. [77] guides the routing process be- basis vectors and only for the first GCN layer, while the later
tween these two capsule layers. To that end, a continuous vector layers share the relations through the relation embedding trans-
was produced whose length can be used to compute the score formations performed by a learnable transformation matrix. This
function of the triple: makes COMPGCN more parameter efficient than R-GCN.
Recently more and more novel effective GCN methods are pro-
s(h, r , t) = ∥capsnet(g([vh , vr , vt ] ∗ Ω ))∥ posed to conduct the graph analytical tasks. To efficiently exploit
where capsnet and ∗ mean the capsule network operator and the structural properties of relational graphs, some recent works
convolution operation respectively. Experimental results confirm try to extend multi-layer GCNs to specific tasks for obtaining
that the CapsE model performs better than ConvKB [68] on proper graph representation. For example, Bi-CLKT [83] and JKT
WN18RR and FB15k-237. [84], which are both knowledge tracing methods [85], apply two-
layer GCN structure to encode node-level and global-level repre-
3.1.2.3. Graph Convolution Network (GCN)-based KGC sentations for relational subgraphs exercise-to-exercise (E2E) and
models. Graph Convolution Network (GCN) [78] was introduced concept-to-concept (C2C), respectively. The utilization of two-
as a generalization of Convolutional Neural Networks (CNNs),1 layer GCN can effectively learn the original structural information
which are a popular neural network architecture defined on from multidimensional relationship subgraphs. Besides, ie-HGCN
a graph structure [70,80,82]. Recently, lots of researchers have [86] try to learn interpretable and efficient task-specific object
employed GCNs to predict missing facts in KGs. representations by using multiple layers of heterogeneous graph
R-GCN [70] is presented as an extension of GCNs that oper- convolution on the Heterogeneous Information Network (HIN)
ate on local graph neighborhoods to accomplish KGC tasks. R- [87]. Based on these works, a possible direction of future research
GCN uses relation-specific transformations different from regular is to explore the multi-layer GCN to efficiently capture different
GCNs as the encoder side. For the LP task, the DisMult model levels of structural information of KGs for the KGC task.
was chosen to be the decoder to perform a computation of
3.1.2.4. Generative adversarial network (GAN)-based KGC models.
an edge’s score. To avoid over-fitting on sparse relations and
Generative adversarial network (GAN) [88] is one of the most
massive growth of model parameters, this work utilizes block-
promising methods for unsupervised learning on complex dis-
diagonal-decomposition methods to regularize the weights of
tribution in recent years, whose intention is originally proposed
R-GCN layers. R-GCN can act as a competitive, end-to-end train-
for generating samples in a continuous space such as images.
able graph-based encoder (just like SACN [71] shows), i.e., in
GAN usually consists of at least two modules: a generative module
LP task, the R-GCN model with DistMult factorization as the
and a discriminative module, the former accepts a noise input
decoding component outperformed direct optimization of the
factorization model and achieved competitive results on standard and outputs an image while the latter is a classifier that clas-
LP benchmarks. sifies images as ‘‘true’’ (from the ground truth set) or ‘‘fake’’
(generated by the generator), these two parts train and learn
Structure-Aware Convolutional Network (SACN) [71] is an end- together in a confrontational way. However, it is not possible to
to-end model, where the encoder uses a stack of multiple W-GCN use the original version of GANs for generating discrete samples
(Weighted GCN) layers to learn information from both graph like natural language sentences or knowledge graph triples since
structure and graph nodes’ attributes, the W-GCN framework gradients from propagation back to the generator are prevented
addresses the over-parameterization shortcoming of GCNs by by the discrete sampling step [28] until SEQGAN [89] firstly
assigning a learnable relational specific scalar weight to each gives successful solutions to this problem by using reinforcement
relation and multiplies an incoming ‘‘message’’ by this weight learning — it trains the generator using policy gradient and other
during GCN aggregation. The decoder Conv-TransE is modified tricks. Likewise, there have been arisen lots of KGC works that
based on ConvE but abolishes the reshape process of ConvE, and incorporated the GAN framework in knowledge representation
simultaneously keeps the translational property among triples. In learning. Table 7 shows the general information about the GAN-
summary, the SACN framework efficiently combines the advan- based negative sampling methods. Intuitively, we place Fig. 11 to
tages of ConvE and GCN, thus obtain a better performance than reveal the frame structure of GAN-based models.
the original ConvE model when experimenting on the benchmark
datasets FB15k-237, WN18RR. KBGAN [28] aims to employ adversarial learning to generate high-
quality negative training samples and replace formerly used uni-
COMPGCN [72] Although R-GCN and W-GCN show performance
form sampling to improve Knowledge Graph Embedding (KG em-
gains on KGC task, they are limited to embedding only the entities
bedding). As Fig. 11(a) shows, KBGAN takes KG embedding mod-
els that are probability-based and have a log-loss function as the
1 Whereas CNNs require regular structure data, such as images or sequences,
generator to supply better quality negative examples, while the
GCNs allow for irregular graph-structured data [79]. GCNs can learn to extract discriminator uses distance-based, margin-loss KG embedding
features from the given node (entity) representations and then combine these
features together to construct highly expressive entity vectors, which can further
models to generate the final KG embeddings. More specifically,
be used in a wide variety of graph-related tasks, such as graph classification [80] it expects the generator to generate negative triples (h′ , R, t ′ )
and generation [81]. that obey the probability distribution of pG (h′ , r , t ′ |h, r , t), and
17
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 7
Characteristic of several GAN-based negative sampling technologies for KGC.
Models KBGAN [21] IGAN [31] KSGAN [29]
Modules Generator, discriminator Generator, discriminator Generator, discriminator, knowledge selector
Generator Semantic matching models Neural network Translational distance models
with softmax probabilistic models with softmax probabilistic models
Discriminator Translational distance models KGE models Semantic matching models
J(θ ) = Ee∼p(e|·;θ ) [R]
∑ ∑ ∑
Generator reward RG = E(h′ ,r ,t ′ )∼pG [R] RG = E(h′ ,r ,t ′ )∼pG [R]
function (h,r ,t)∈T (h,r ,t)∈T (h′ ,r ,t ′ )∈Ts′
(h∗ , r , t ∗ ) ∈ Neg(h, r , t)
Table 8
Published results of Neural Network-based KGC methods. Best results are in bold.
Model WN18RR FB15K-237
MR MRR Hits@1 Hits@3 Hits@10 MR MRR Hits@1 Hits@3 Hits@10
ConvE [33]b 4464 0.456 0.419 0.470 0.531 245 0.312 0.225 0.341 0.497
ConvKB [68]a 3433 0.249 – – 0.524 309 0.243 – – 0.421
CapsE [69]a 718 0.415 – – 0.559 403 0.150 – – 0.356
InteractE [67] 5202 0.463 0.430 – 0.528 172 0.354 0.263 – 0.535
R-GCN [70]b 6700 0.123 0.080 0.137 0.207 600 0.164 0.100 0.181 0.300
SACN [71] – 0.470 0.430 0.480 0.540 – 0.350 0.260 0.390 0.540
Conv-TransE [71] – 0.460 0.430 0.470 0.520 – 0.330 0.240 0.370 0.510
SACN with FB15k-237-Attr [71] – – – – – – 0.360 0.270 0.400 0.550
COMPGCN [72] 3533 0.479 0.443 0.494 0.546 197 0.355 0.264 0.390 0.535
ParamE-MLP [66] – 0.407 0.384 0.429 0.445 – 0.314 0.240 0.339 0.459
ParamE-CNN [66] – 0.461 0.434 0.472 0.513 – 0.393 0.304 0.426 0.576
ParamE-Gate [66] – 0.489 0.462 0.506 0.538 – 0.399 0.310 0.438 0.573
KBGAN [28] – 0.215 – – 0.469 – 0.277 – – 0.458
KSGAN [29] – 0.220 – – 0.479 – 0.280 – – 0.465
a
Resulting numbers are re-evaluated by [90].
b
Resulting numbers are reported by [91], and others are taken from the original papers.
3.1.2.5. Performance analysis about neural network-based KGC mod- number of training data, which is a kind of data-driven works, there-
els. We report the published results of Neural Network-based fore they usually do not perform well when dealing with sparse
KGC approaches in Table 8 and make a simple comparison be- KG data because of its great dependence on data. Moreover, these
tween them. From Table 8 we have the following findings: kinds of models have some other shortcomings, such as low
1. Among the first four CNN-based KGC models, CapsE performs interpretation, too many parameters, and poor performance in
well on the WN18RR because (1) in CapsE, the length and orien- handling sparse KGs.
tation of each capsule in the first layer can help to model the im- With the diversity research of the KGC method, more addi-
portant entries in the corresponding dimension, so that CapsE is tional information is used in the completion work. It should be
good at handling much sparser datasets, like WN18RR. (2) CapsE noted that there are several models we previously discussed mak-
uses pre-trained Glove [92] word embeddings for initialization ing use of some additional information besides structural infor-
and uses additional information. mation. For example, the typical neural network KGC model SACN
2. R-GCN, SACN and its variants, and COMPGCN are all the ex- [71] applies a weighted graph convolutional network (WGCN)
tensions of GCNs, both SACN and COMPGCN make use of the as its encoder, which utilizes node attributes and relation types
weighted GCN to aggregate the neighbor information by the information.
learnable weights, therefore they all perform relatively consistent The widely known CNN-based KGC models have effective per-
excellent results on all datasets. Besides, ‘‘SACN with FB15k- formance that benefit from the strong expressiveness of neural
237-Attr’’ uses additional attribute information in the FB15k-237 networks. Typically, the ConvE and ConvKB tend to be applied
dataset, which further results in higher results on the FB15k-237. as the decoder model in lots of KGC methods (such as [72,91])
3. We observe that the ‘‘ParamE-Gate’’ basically outperforms all to conduct KGC. So also, there are other various neural network
the other neural network models, obviously reflects in the MRR, families that have been widely applied working with different
Hits@1, and Hits@3 metrics on both datasets. Note that ConvE and additional information for conducting KGC. Take the recurrent
ParamE-CNN have similar network architectures, but ParamE- neural network (RNN) as an example, because of its superior
CNN achieves a substantial improvement over ConvE. ParamE- ability to learn sequence features, RNN often is used in the re-
CNN takes parameters in itself as relation embeddings, which lational path-based KGC methods and also be exploited to deal
can capture the intrinsic property and is more reasonable [66]. with long text information (e.g., entity description text) for KGC.
The performance comparison among ‘‘ParamE-MLP’’, ‘‘ParamE- Similarly, CNN can be regarded as a feature extractor for textual
CNN’’ and ‘‘ParamE-Gate’’ shows that MLP has a weaker modeling feature modeling in KGC procedure substituting RNN structure
ability than convolution layers and the gate structure. More- (e.g., [93–95]). Zia et al. [96] is also an example that involves
over, although convolution layers are good at extracting fea- GAN structure combined with path information, which will be
tures, ‘‘ParamE-CNN’’ performs worse than ‘‘ParamE-Gate’’ be- introduced in detail in the subsequent additional information
cause the gate structure can optionally let some useful informa- based KGC methods.
tion through. In addition, although the differences between the
FB15k-237 dataset and the WN18RR dataset let some models get 3.2. Translation models
un-balanced performance for the two datasets, ParamE-Gate can
work well in both datasets.
As a family of methods concentrating on distributed represen-
3.1.2.6. Discussion on Neural Network Models. Also be known as tation learning for KGs, translation models are both straightfor-
non linear models, the neural network KGC models relying on ward and have satisfied performance on KGC, they are promising
neural network structure (along with the non-linear Activation to encode entities as low dimensional embeddings and relations
Function, such as sigmoid function, tanh function, Rectified Linear between entities as translation vectors. This kind of model usually
Unit (ReLU) function etc., this situation can be seen from Table 6) defines a relation-dependent translation scoring function to mea-
to learn deep potential features. sure the probability of a triple through the distance metric. In
Many literatures on KGE use neural networks to represent the ordinary sense, the distance score reflects the correctness of
KGs in low-dimensional continuous space [11,14,15,64]. It can a triple (h, r , t), and more generally, it collocates with a margin-
effectively extract hidden latent features needed for knowledge based ranking loss for learning the translation relation between
reasoning with strong accuracy, high reasoning scalability, and entities. We also list a brief table about the basic characteristics
efficiency. However, neural network KGC models rely on a large of introduced translation models in Table 9.
19
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 9
Summarization and comparison about Translation models for KGC.
Model Highlights Score Function Notion Difination Loss Objectivea Datasetsb
TransE Extensions:
TransE Precursory s(h, r , t) = ∥vh + vr − vt ∥ vh , vr , vr ∈ Rd Lmarg LP: WN, FB15K,
[11] translation method FB1M
TransH Performs s(h, r , t) = ∥vh⊥ + vr − vt⊥ ∥, vh , vr , vr ∈ Rd ; Lmarg LP: WN18, FB15k;
[15] translation in vh⊥ = vh − wrT vh wr , wr ∈ Rd TC: WN11, FB13,
relation-specific vt⊥ = vt − w vt wr T
r FB15K
hyperplane
TransR Converts entity s(h, r , t) = ∥Mr vh + vr − Mr vt ∥ vh , vt ∈ Rd , vr ∈ Lmarg LP: WN18, FB15k;
[12] space to relation Rk ; TC: WN11, FB13,
space Mr ∈ Rk×d FB15K
relational space
projection
TransD Different relational s(h, r , t) = ∥Mrh vh + vr − Mrt vt ∥ vh , vt , vhp , vtp ∈ Lmarg LP: WN18, FB15k;
[97] mapping Mrh = vrp vhTp + I k×d , Rd ; TC: WN11, FB13,
matrix to head Mrh = vrp vhTp + I k×d vr , vrp ∈ Rk ; FB15k
and tail entity; Mrh , Mrt ∈ Rk×d
vector
multiplication
lppTransD Role-specific s(h, r , t) = ∥Mrh
′
vh + vr − Mrt′ vt ∥ vh , vt , vhp , vtp ∈ Lmarg LP: WN18, FB15K;
[98] projection ′
Mrh = vrph vhTp + I k×d , Rd ; TC: WN11, FB13,
Mrt′ = vrpt vtTp + I k×d vr , vrph , vrpt ∈ Rk ; FB15K
Mrh , Mrt ∈ Rk×d
TransF Light weight and s(h, r , t) = ∥Mrh vh + vr − Mrt vt ∥, vh , vr ∈ Rd , vr ∈ Lmarg LP: FB15k, WN18;
[99] robust; f
(i) (i) Rk ; TC: FB15k-237,
α
∑
explicitly model Mrh = r U + I, U (i) , V (i) ∈ Rk×d WN18RR
i=1
basis subspaces f Mrh , Mrh ∈ Rk×d
βr(i) V (i) + I
∑
of projection Mrt =
matrices i=1
3.2.1. TransE extensions among the head entity h and relation r, that are:
We introduce several prominent translation KGC models in
TransE [11] family, which are frequently summarized and cited vh + vr ≈ vt
in lots of literature. We draw a comprehensive figure exhibiting and it defines its score function as:
some representative translation models (shown as Fig. 12).
s(h, r , t) = ∥vh + vr − vt ∥l1/2
TransE [11] as a pioneer translation KGC model, can balance
both effectiveness and efficiency compared to most traditional However, the over-simplified translation assumption TransE
methods. TransE projects entities and relations together into a holds might constraint the performance when modeling com-
continuous low-dimensional vector space, where the tail-entity t plicated relations, which leads to a weak character that TransE
in triple (h, r , t) can be viewed as the translation operator results can only model pure 1 − 1 relations in KGs. To effectively learn
20
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 9 (continued).
Model Highlights Score Function Notion Difination Loss Objectivea Datasetsb
Modification to Loss Objection of Translation-based KGC:
TransRS Upper limit score s(h, r , t) = ∥vh + vr − vt ∥ vh , vr , vt ∈ Rd Lmarg LP: WN18, FB15k;
[106] function fr (h, t) ≤ γ ′ +Llimit TC: WN11, FB13,
for positive FB15K
triplets;
limit-based
scoring loss
TransESM Trans-RS+TransE’s s(h, r , t) = ∥vh + vr − vt ∥ vh , vr , vt ∈ Rd ; soft Lmarg A scholarly KG
[107] score function fr (h, t) ≤ γ1 , γ2 ≥ γ1 ≥ 0;
Soft Margin loss fr (h′ , t ′ ) ≥ γ2 − ξhr,t , (h′ , r ′ , t ′ ) ∈ T ′ ,
ξhr,t ≥ 0 (h, r , t) ∈ T
Transition Models in Novel Vector Space:
TransA Adaptive metric s(h, r , t) = vh , vr , vt ∈ Rd Lmarg LP: WN18, FB15K;
[108] approach; (|vh + vr − vt |)T Wr (|vh + vr − vt |) TC: WN11, FB13
.
elliptical surfaces |x| = (|x1 |, |x2 |, . . . , |xn |),
modeling xi = vhi + vri − vti
TorusE TransE+Torus s(h, r , t) = [h], [r ], [t ] ∈ T n Lmarg LP: WN18, FB15K
[109] min(x,y)∈([h]+[r ])×[t ] ∥x − y∥ T is a torus space
RotatE Entire complex s(h, r , t) = ∥vh ◦ vr − vt ∥ vh , vr , vt ∈ Cd ; Lns LP: FB15k, WN18,
[110] space C; vri = C, |vri | = 1 FB15k-237,
self-adversarial WN18RR
negative sampling
a
Put simply, the Lns and Lmarg are negative sampling loss and margin-based ranking loss respectively, also, LCmarg means a Confidence-aware margin-based ranking loss
[111], and Llimit refers to the Limit-based Scoring Loss in [106], while the LHRS is the HRS-aware loss function in [112].
b
When we describe the datasets, we apply the shorthand for: ‘LP’ means Link Prediction task, while ‘TC’ means Triple Classification task.
*** The vrc , vr′ and vrs are respective the relation cluster embedding, relation-specific embedding and sub-relation embedding in [112].
c
Pr () is a projection function.
Fig. 12. TransE and its extension models. These pictures are referred to [12,15,97–99,102,110].
21
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
complex relation types and model various KG structures, a series separate sets of basis matrices U (i) , V (i) , and the two factorized
of enhanced translation-based KGC models continuously improve projection matrices are calculated as:
the TransE. s
∑
TransH [15] projects the entities onto the relation-specific hyper- M r ,h = αr(i) U (i) + I
plane wr (the normal vector) by h⊥ = vh − wrT vh wr or t⊥ = vt − i=1
s
wrT vt wr and then performs translation actions on this hyperplane, ∑
so that the score function is defined as follows: M r ,t = βr(i) V (i) + I
i=1
s(h, r , t) = ∥h⊥ + vr − t⊥ ∥22 Inspired by TransR, TransF is robust and lightweight enough to
which can model the 1 − n, n − 1 even n − n relations availably. deal with the large-scale KGs through easily learning multiple
relations by explicitly modeling the underlying subspace of the
TransR [12] considers that there are semantic differences be- relation’s specific projection matrix.
tween entities and relations so that they should be in different
semantic spaces. Moreover, different relations should constitute STransE [100] properly combines insights from SE [113] and
TransE [11], draws on the experience of relation-specific matrices
different semantic spaces. It converts the entity space to corre-
in SE for relation-dependent identification of both head entity
sponding relation space through a relational projection matrix
and tail entity, also follows the basic translation principle in the
Mr ∈ Rd×k , the translation performed in relation space is:
TransE model.
vh Mr + vr ≈ vt Mr Trans-FT [101] develops a general principle called Flexible Trans-
In order to better model internal complicated correlations within lation (FT), which enables it to model complex and diverse objects
diverse relation type, this work also extends TransR by incorpo- in KGs unlike those previous translation models only concentrate
rating the idea of piecewise linear regression to form Cluster- on strict restriction of translation among entities/relations (such
as TransE). Experiment adapts FT to existing translation models,
based TransR (CTransR), they introduce cluster-specific relation
TransR-FT gets the best performance compared to other two
vector rc for each entity pairs cluster and matrix Mr . However,
baselines (TransE-FT and TransH-FT).
although TransR performs well in handling complicated relation
patterns, it involves too many additional parameters to result in
3.2.2. Translation models with attention mechanism
poor robustness and scalability issues for large KGs learning.
TransM [102] is an appropriate solution to the inflexible issue
TransD [97] further advances TransR by assigning different rela-
in TransE. They focus more on the diverse contribution (i.e. var-
tional mapping matrix Mrh , Mrt ∈ Rm×n to head and tail entity ious relational mapping properties) of each training triple to the
respectively: final optimization target, therefore TransM decides to develop a
Mrh = rp hTp + I m×n weighted mechanism, with which each training triple can be as-
signed a pre-calculated distinct weight according to its relational
mapping property. In other words, we can regard this weighted
Mrt = rp tpT + I m×n operation as an attention mechanism that takes every training
example as a impact attention to tackle well with the various
h⊥ = Mrh h, t⊥ = Mrt t mapping properties of triplets.
The subscript p marks the projection vectors. Then it scoring a ITransF [103] To make full use of the shared conceptions of
relations and apply it to perform knowledge transfer effectively,
triple (h, r , t) by defining the following function:
ITransF outfits with a sparse attention mechanism to discover
s(h, r , t) = −∥h⊥ + r − t⊥ ∥22 sharing regularities for learning the interpretable sparse attention
vectors, which fully capture the hidden associations between
Thus each objects in KGs is equipped with two vectors. Addition- relations and sharing concepts.
ally, TransD replaces matrix multiplication with vector multipli-
cation which significantly increases the speed of operation. TransAt [104] effectively learns the translation-based embedding
using a reasonable attention mechanism, it exploits a piecewise
lppTransD [98] is an extension of TransD, which accounts for evaluation function which divides the KGC problem into a two-
different roles of head and tail entities. They indicated that logical stage process: checking whether the categories of head and tail
properties of relations like transitivity and symmetry cannot be entities with respect to a given relation make sense firstly, and
represented by using the same projection matrix for both head then considering for those possible compositions, whether the
and tail entities [99]. To preserve these logical properties, the lpp- relation holds under the relation-related dimensions (attributes).
series ideas consider a role-specific projection that maps an entity During this two-stage process, TransAt uses K-means to cluster
to a distinct vector according to its role in a triple, whether is a generating categories for generality. TransAt sets the projection
head entity or a tail entity. The concrete mapping matrices are function by computing the variances between head (tail) entities
designed as: associated with relation r in the training set for each dimension,
additionally, it designs a threshold to determine whether a di-
′
Mrh = rph hTp + I m×n mension should be retained. In consideration of the ORC structure
problems [114], TransAt utilizes an asymmetric operation on both
Mrt′ = rpt tpT + I m×n head entity and tail entity, therefore the same entities will have
different representations of head position and tail position.
TransF [99] is similar to lppTransD, which also applies the same TransGate [105] pays close attention to inherent relevance be-
idea to compute the projection matrices for head and tail entities tween relations. To learn more expressive features and reduce
separately. The difference between lppTransD and TransF is that parameters simultaneously, TransGate follows the thought of pa-
TransF mitigates the burden of relation projection by explicitly rameter sharing using gate structure and then integrates the
modeling the basis subspaces of projection matrices with two shared discriminate mechanism into its architecture to ensure
22
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 10
Published link prediction results of translation models. Best results are in bold.
Model WN18 FB15K WN18RR FB15K-237
MR MRR Hits@10 MR MRR Hits@10 MR MRR Hits@10 MR MRR Hits@10
TransE [11] 251 – 0.892 125 – 0.471 2300a 0.243a 0.532a 323a 0.279a 0.441a
TransH [15] 303 – 0.867 87 – 0.644 – – – – – –
TransR [12] 225 – 0.920 77 – 0.687 – – – – – –
TransD [97] 212 – 0.922 91 – 0.773 – – – – – –
lppTransD [98] 270 – 0.943 78 – 0.787 – – – – – –
TransF [99] 198 0.856 0.953 62 0.564 0.823 3246 0.505 0.498 210 0.286 0.472
STransE [100] 206 0.657 0.934 69 0.543 0.797 – – – – – –
Trans-FT [101] 342 – 0.953 49 – 0.735 – – – – – –
TransM [102] 281 – 0.854 94 – 0.551 – – – – – –
ITransF [103] 223 – 0.952 77 – 0.814 – – – – – –
TransAt [104] 157 – 0.950 82 – 0.782 – – – – – –
TransRS [106] 357 – 0.945 77 – 0.750 – – – – – –
TransA [108] 392 – 0.943 74 – 0.804 – – – – – –
TorusE [109] – 0.947 0.954 – 0.733 0.832 – – – – – –
RotatE [110] 309 0.949 0.959 40 0.797 0.884 3340 0.476 0.571 177 0.388 0.533
TransGate [105] – – – 33 0.832 0.914 3420 0.409 0.510 177 0.404 0.581
a
Resulting numbers are reported by [91] and others are taken from the original papers.
Trans-FT, we can conclude that: (1) Based on the translation since that WN18RR removes reverse relations and destroys the
idea of TransE, for a triple (h, r , t), it is necessary to further inherent structure of WordNet, which results in low relevance
consider the semantic differences between entities and relations. between relations and further reduces the effect of parameter
(2) TransF achieves a clear and substantial improvement over sharing [105].
others in this series. The reason is that TransF factorizes the
relation space as a combination of multiple sub-spaces for repre- 3.2.6. Discussion on translation models
senting different types of relations in KGs. Besides, TransF is more In summary, the translation models based on internal struc-
robust and efficient than congeneric methods by modeling the ture information are simple but surprisingly effective when solv-
underlying subspace of the relation’s specific projection matrix ing the KGC problems. Additionally, the translation models only
for explicitly learning various relations. need few parameters. At present, translation models usually are
2. The attention-based methods TransM, ITransF, and TransAt served as the basis for extended models that exploit a wider variety
almost consistently outperform TransE. Specifically, ITransF per- of additional information sources, which benefits from the easy-to-
forms better on most of the metrics of WN18 and FB15k, while use translation transformation hypothesis. Ordinarily, collaborate
TransM has a poor result on the sparser WN18 dataset. The transitional characteristics with additional information to con-
reason is that ITransF employs a sparse attention mechanism duct KGC is an ongoing trend. This bunch of methods take account
to encourage conceptions of relations sharing across different of other useful information instead of only utilizing the inner
relations, which primarily benefit facts associated with rare re- structure information, based on the translation distance classic
lations. TransAt focuses on the hierarchical structure among the baselines or follow the basic translation assumption thought. For
attributes in an entity, so it utilizes a two-stage discriminative instance, OTE [115] advances RotatE in two ways: (1) leverag-
method to achieve an attention mechanism. It suggests that the ing orthogonal transforms [116] to extend the RotatE from 2D
proper attention mechanism can help to fit the human cognition complex domain to high dimension space for improving mod-
of a hierarchical routine effectively. eling ability, and (2) making use of the context information of
3. Both TorusE and RotatE get good performance on the WN18 nodes. PTransE (path-based TransE) [12] and PTransD [117] are
and FB15k. RotatE is good at modeling and inferring three types of both the path-augmented translation based models, while TransN
relation patterns: the symmetry pattern, the composition pattern, [31] considers the dependencies between triples and incorporates
and the inversion pattern, by defining each relation as a rotation neighbor information dynamically. On the other hand, people
in complex vector spaces. By comparison, TorusE focuses on the begin to explore how to implement the basic translation transfor-
problem of regularization in TransE. Although TorusE can be mation of entities and relations in a more effective and reasonable
regarded as a special case of RotatE since it defines KG embed- modeling space to easily model complex types of entities and
dings as translations on a compact Lie group, the modulus of relations and various structural information. Under this case, the
embeddings in TorusE are set fixed, while in RotatE is defined improvement and optimization of the loss function is also a
on the entire complex space, which is very critical for modeling promising research direction.
and inferring the composition patterns. Therefore, RotatE has
much more representation capacity than TorusE, which may help 4. Additional information-based KGC technologies
explain why RotatE gains better performance than TorusE on the
WN18 and FB15k. The research on additional information-based KGC has re-
4. TransGate achieves excellent performance on four datasets, ceived increasing attention in recent years. The techniques as sur-
especially in the metrics of FB15k and FB15k-237. These results veyed in Section 3 perform KGC mainly relying on the structure
show the appropriateness of sharing discriminate parameters and information of KGs (i.e., the simple triple structure information),
the great ability of gate structure. Actually, TransGates is a better of course, several methods mentioned in Section 3 also simulta-
trade-off between the complexity and the expressivity by follow- neously utilize the additional information for KGC. For example,
ing the parameter sharing strategy. With the help of the shared KBAT [91] considers the multi-hop neighborhood information
discriminate mechanism based on the gate structure, TransGate of a given entity to capture entity and relation features, and
can optimize embeddings and reduce parameters simultaneously. DrWT [45] leverages the additional Wikipedia page document of
However, TransGate has a poorer performance on the WN18RR, entities outside KGs. In this section, we focus on the additional
24
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 11
Characteristics of KGC methods using nodes’ attributes information.
Model Highlights Nodes information Jointly learning expression Datasets
KGC using numeric atrribute information
logp((h, r , t)|θ1 , . . . , θn )a
∑
KBLRN [118] End-to-end jointly training model; Numerical attributes L=− FB15k-num,
multi-task learning; (h,r ,t)∈T FB15k-237-num
feature types-combining approach
MTKGNN End-to-end multi-task NN Numeric attributes Lattr = Lhead + Ltail YG24K, FB28K
[119] Lhead = MSE(gh (ai ), (ai )∗ )
Ltail = MSE(gt (aj ), (aj )∗ )
TransEA [120] TransE + numerical attributes Numeric attributes L = (1 − α ) · LTransE + α · LA YG58K, FB15K
LTransE : TransE loss; LA : attribute loss
KGC using textual atrribute information
JointAs [15] Jointly neural network model Node’s name, anchors L = LK + LT + LA Freebase
LK : KGC loss;
LT : Text model loss;
LA : Alignment loss
JointTs [121] Replaces anchors in JointAs with text Node’s name L = LK + LT + LAT Freebase
description LAT : text description-aware Alignment loss
KGC using image atrribute information
IKRL [122] Neural image encoder; Image attributes s(h, r , t) = sSS + sSI + sIS + sII WN9-IMG
translation-based decoder; sXY = ∥hX + r − tY ∥,
attention mechanism S(I): structure(image)-based representations
KGC using multi-model atrribute information
VALR [123] Linguistic embeddings; Text attributes, image s(h, r , t) = sS + sM1 + sM2 + sSM + sMS FB-IMG, WN9-IMG
neural network architecture; attributes sS = ∥hS + rS − tS ∥,
multi-model additional energy function sM1 = ∥hM + rS − tM ∥,
sM2 = ∥(hM + hS ) + rS − (tM + tS )∥,
sSM = ∥hS + rS − tM ∥,
sMS = ∥hM + rS − tS ∥
S /M: structure/multi-modal representations
∑∑ h, r h, r h, r h, r
MKBE [124] Feature type specific encoders/decoders; Text attributes, images L= lt log(pt ) + (1 − lt )log(1 − pt ) YAGO-10
DistMult/ConvE; attributes, numeric (h,r) t
h,r
multi-modal KGs modeling; attributes pt = σ s(h, r , t),
h,r
VGG pretrained network on ImageNet lt : a binary label
logp((h, r , t)|θ1 , . . . , θn )a
∑
MMKG [125] Relational reasoning across different entities Numeric attributes, L=− DB15K, YAGO15K,
and images images attributes (h,r ,t)∈T FB15K
LiteralE [126] End-to-end universal extension module Numeric attributes, text sX (h, r , t) → sX (g(h, lh ), r , g(t , lt )) FB15k, FB15k-237,
attributes g(): a gated function; X : specific KGE models YAGO-10
a
θi : the parameters of individual model.
Table 13
Summarization of introduced KGC methods using Entity-related information.
Model Technology Entity information Dataset
KGC using entity types information:
TRESCAL [127] a. base on RESCAL; Entity type information; textual data NELL
b. low computational complexity;
c. entity-type constraints
TCRL [128] a. entity-type constraint model; Entity type information Dbpedia-Music,
b. under closed-world assumption FB-150k,YAGOc-195k
TransT [129] a. dynamical multiple semantic vectors; Structured information; entity type information FB15k, WN18
b. entities-relations similarity as prior knowledge
FRNs [130] a. jointly modeling KGs and aligned text; Entity type information; additional textual FB15k-237
b. a composition and scoring function parameterized by evidence
a MLP
OBDL [131] a. deep learning framework (NTN); Entity type hierarchy feature; ontological WordNet, Freebase
b. a new initialization method for KGE; information
c. unseen entity prediction
KGC using entity hierarchy taxonomic information:
EHE [132] a. distance matrix; Entity hierarchy information Wikipedia snapshot
b. entity similarity measuring
SSE [133] Portable smoothness assumption: Entity semantic categories NELL_L, NELL_S, NELL_N
a. Laplacian Eigenmaps 186
b. Locally Linear Embedding.
NetHiex [134] a. a nonparametric probabilistic framework Hierarchical taxonomy information BlogCatalog, PPI, Cora,
b. nested Chinese restaurant process Citeseer
c. EM algorithm
GTF [135] a. knowledge guided tensor factorization method; Entity taxonomy hierarchy; corresponding Animals, Science
b. guided quantification constraints; relation schema
c. imposing schema consistency
SimplE+ [136] SimplE with non-negativity constraints Subclass and subproperty taxonomic WN19, FB15K, Sport,
information of entity Location
Table 14
Characteristics of introduced KGC methods using relation-related information.
Model Technologies Relation-related information Datasetsa
TranSparse [139] Complex relation-related transformation matrix Heterogeneous and imbalance characteristics LP: FB15k, WN18,
of relations FB15k-237, WN18RR
AEM [140] Relation weight Asymmetrical and imbalance characteristics of LP: WN18, FB15K;
relations TC: WN11, FB13, FB15K
Trans-HRS [112] TransE/TransH/DistMult + HRS structure Three-layer HRS structure information of LP: FB15K, WN18
relations
On2Vec [141] a. Component-specific Model encoder Hierarchical relations RP: DB3.6K,CN30K,
b. Hierarchy Model YG15K,YG60K
JOINTAe [142] a. autoencoder Compositional information of relations LP: WN18, FB15k,
b. considers relation inverse characteristic WN18RR, FB15k-237
c. based on RESAC
d. relations composition in [143]
Riemannian- a. multi-relational graph embedding Multi-relational (hypernym and synonym) TP: WN11, FB13
TransE b. Non-Euclidean Space modeling information of relations
[144] c. based on TransE
d. non-Euclidean manifold
TRE [145] Relation inference based on the triangle pattern of Entity-independent transitive relation patterns LP: FB15K, WN18,
knowledge base RP: FB15K, WN18, DBP
a
‘LP’, ‘RP’ and ‘TC’ respectively refer to Link Prediction task, Relation Prediction task and Triple Classification task.
Table 15
Characteristics of introduced KGC methods using neighbor information.
Model Technology Additional information Datasets
Aggregating neighbors with attention mechanism:
A2N [150] DistMult + attention scoring Neighbor structure information FB15K-237, WN18RR
LENA [149] Windowed Attentions; Neighbor structure information FB15K, FB15K-237, WN18, WN18RR
Cross-Window Pooling
LAN [151] Logic Attention Network; Relation-level information; Subject-10 and Object-10 in FB15K
end-to-end model: neighbor-level information
Encoder: LAN, Decoder: TransE
G2SKGEatt [152] Graph2Seq network; Neighbor structure information FB15K, FB15K-237, WN18, WN18RR
attention mechanism;
end-to-end model:
Encoder: Graph2Seq, Decoder: ConvE
KBAT [91] Generalized GAT; Entity’s multi-hop neighborhood FB15K-237, WN18RR, NELL-995, Kinship
end-to-end model:
Encoder: KBAT, Decoder: ConvKB
RGHAT [153] GNN; Entity’s multi-hop neighborhood FB15K, WN18, FB15K-237, WN18RR
hierarchical attention mechanism;
end-to-end model:
Encoder: RGHAT, Decoder: ConvE
Other technologies for KGC using neighbor information:
GMatching [154] Permutation-invariant network; Neighbor structure information NELL-One, Wiki-One
LSTM;
end-to-end model:
Encoder: neighbor encoder,
Decoder: matching processor
GMUC [155] Gaussian metric learning; Neighbor structure information NL27K-N0, NL27K-N1, NL27K-N2 and
few-shot UKGC; NL27K-N3
end-to-end model:
Encoder: Gaussian neighbor encoder,
Decoder: LSTM-based matching networks
NKGE [31] Dynamic Memory Network; Structure representation; FB15K, FB15K-237, WN18, WN18RR
gating mechanism; neighbor representation
end-to-end model:
Encoder: DMN, Decoder: TransE/ConvE
CACL [93] Contextual information collection; Multi-hop neighborhoods structure information FB13, FB15K, FB15K-237, WN18RR
context-aware convolutional
OTE [115] RotatE; Graph contexts representations FB15K-237, WN18RR
orthogonal transforms
CNNIM [156] Concepts of Nearest Neighbors; Neighbors information FB15k-237, JF17k, Mondial
Dempster–Shafer theory
CAFE [157] Neighborhood-aware feature set; Neighborhood-aware features FB13-A-10, WN11-AR-10, WN18-AR-10,
feature grouping technique NELL-AR-10
representation. For the attention scoring, A2N uses the DistMult neighbors. LAN meets all three significant properties: Permutation
function to project the neighbors in the same space as the target Invariant, Redundancy Aware and Query Relation Aware.
entities.
G2SKGEatt [152] develops a information fusion mechanism
Inspired by the thought of aggregating neighbors with atten-
Graph2Seq to learn embeddings that fuses sub-graph structure
tion mechanism in [150], there has generated a lot of closely
information of entities in KG. To make fusion more meaningful,
relevant studies:
G2SKGEatt formulates an attention mechanism for fusion. The 1-
Termed locality-expanded neural embedding with attention N scoring strategy proposed by ConvE [33] is used to speed up
(LENA) [149] is introduced to filter out irrelevant messages among the training and evaluation process.
neighborhoods with the support of an attentional setting. This
KBAT [91] is also an attention-based KGE model which captures
work indicates that the KG embedding relying on even sufficient
both entity and relation features in the multi-hop neighborhood
structure information is deficient since the graph data tend to
of given entity. KBAT uses ConvKB [68] as its decoder module
be heterogeneous. Therefore, LENA emphasizes that information
and specifically caters to the relation prediction (RP) task. RGHAT
involved in the graph neighborhood of an entity plays a great
[153] designs a novel hierarchical attention mechanism to com-
role in KG embedding in especially with complex heterogeneous
pute different weights for different neighboring relations and
graphs.
entities. Consider that the importance of different relations differ
Logic Attention Network (LAN) [151] is a novel KG-specific greatly in indicating an entity and to highlight the importance
neighborhood aggregator that equips attention mechanism to of different neighboring entities under the same relation, the
aggregate neighbors in a weighted combination manner. This hierarchical attention mechanism including two-level attention
work designs two mechanisms for modeling relation-level and mechanisms: a relation-level attention and an entity-level at-
neighbor-level information respective from coarse to fine: Logic tention. The relation-level attention firstly indicate an entity by
Rule Mechanism and Neural Network Mechanism, in the end, a computing the weights for different neighboring relations of it,
double-view attention is employed to incorporate these two then the entity-level attention computes the attention scores for
weighting mechanisms together in measuring the importance of different neighboring entities under each relation. Finally, each
30
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
entity aggregates information and gets updated from its neigh- instance-based learning. The application of graph pattern instead
borhood based on the hierarchical attentions. RGHAT can utilize of numerical distances makes the proposed method interpretable.
the neighborhood information of an entity more effectively with
CAFE [157] completes KGs using the sets of neighborhood-aware
the use of hierarchical attention mechanism. features to evaluate whether a candidate triple could be added
4.1.4.2. Other technologies for KGC using neighborhood information. into KGs. The proposed set of features helps to transform triples in
Some other works concern different technologies to make use of the KG into feature vectors which are further labeled and grouped
the neighborhood information. for training neural prediction models for each relation. These
models help to discern between correct triples that should be
GMatching [154] takes those one-shot relations which usually added to the KG, and incorrect ones that should be disregarded.
contain valuable information and make up a large proportion of Note that since CAFE exploits the highly connected nature of KGs
KGs into consideration, and introduces an intelligent solution to rather than requiring pre-processing of the KG, it is especially
the problem of KG sparsity caused by long-tail relations. GMatch- suitable for ever-growing KGs and dense KGs.
ing learns knowledge from one-shot relations to solve the sparsity
4.1.4.3. Discussion on KGC models using neighborhood information.
issue and further avoid retraining the embedding models when
From the above introduction and comparison about neighbor-
new relations are added into existing KGs. This model consists of
used KGC literature, we further make a basic discussion and
two components: a neighbor encoder and a matching processor,
analysis as follows:
which are responsible for encoding the local graph structure to
(1) To better obtain the neighborhood graph information, we
represent entities and calculating the similarity of two entity
need to select an appropriate fusion strategy to collect useful
pairs respectively. surrounding neighbor contexts.
GMUC [155] is a Gaussian metric learning-based method that (2) we find that most models tend to use the encoder-to-decoder
aims to complete few-shot uncertain knowledge graphs (UKGs, (end-to-end) architecture when learning neighbor information for
such as NELL and Probase, which model the uncertainty as confi- KGC, in other words, the neighbor learning part is portable which
dence scores related to facts). As the first work to study the few- could be applied to various KGE models such as translation mod-
shot uncertain knowledge graph completion (UKGC) problem, els (e.g., TransE [11], TransH [15], TransR [12]) and Bilinear mod-
GMUC uses a Gaussian neighbor encoder to learn the Gaussian- els [42,161]. We give a presentation about these end-to-end
based representation of relations and entities. Then a Gaussian structures in Table 15, and show them in Fig. 19 to illustrate this
matching function conducted by the LSTM-based matching net- intuition.
works is applied to calculate the similarity metric. The matching (3) The embedding parameters for every entity-relation pair may
be prohibitively large when the learned neighbor fusion is fixed,
similarity can be further used to predict missing facts and their
which led to the adaptable mixture methods based on the differ-
confidence scores. GMUC can effectively capture uncertain se-
ent query are more and more popular over recent years.
mantic information by employing the Gaussian-based encoder
and the metric matching function.
4.1.5. Relational path information
NKGE [31] uses a End-to-End Memory Networks (MemN2N) based In KGs, there are substantial multiple-step relation paths be-
Dynamic Memory Network (DMN) encoder [158] to extract infor- tween entities indicating their semantic relations, these relation
mation from entity neighbors, and a gating mechanism is utilized paths reflect complicated inference patterns among relations in
to integrate the structure representations and neighbor represen- KGs [12], it helps to promote the rise of the path-based relation
tations. Based on TransE [11] and ConvE [33] respectively, NKGE inference, one of the most important approaches to KGC task
designs two kinds of architectures to combine structure represen- [168]. We generally list these path-based KGC works in Table 16.
tation and neighbor representation. Experimental results show multi-hop KGC (mh-KGC): We refer to the definition in [163],
that the TransE-based model outperforms many existing trans- the mh-KGC aims at performing KGC based on existing relation
lation methods, and the ConvE-based model gets state-of-the-art paths. For example in Fig. 20, for the relation path Microsoft →
metrics on most experimental datasets. IsBasedIn → Seattle → IsLocatedIn → Washington →
IsLocatedIn → United States (as the blue lines shows), the task
Context-aware convolutional learning (CACL) [93] is a study
is to predict whether (or what) there exists direct relations that
of exploring the connection modes between entities using their
connects h and t; i.e., (Microsoft , CountryOfHQ , United States) in
neighbor contexts information, which facilitates the learning of
this case. This kind of reasoning lets us infer new or missing
entity and relation embeddings via convoluting deep learning
facts from KGs. Sometimes there can exist multiple long paths
techniques directly using the connection modes contained in each
between two entities, thus in this scene, the target relation may
multi-hop neighborhood. be inferrable from not only one path.
Orthogonal transform embedding (OTE) [115] advances RotatE
[110] in two ways: (1) leveraging orthogonal transforms [116] 4.1.5.1. Multi-hop KGC using atomic-path features. Path Ranking
to extend RotatE from 2D complex domain to high dimension Algorithm (PRA) [164] is the first work that emerges as a promis-
space in order to raise modeling ability, and (2) OTE takes account ing method for learning inference paths in large KGs, it uses
of the neighbor contexts information, effectively learns entity random walks to generate relation paths between given entity
embeddings by fusing relative graph contexts representations. pairs by depth-first search processes. The obtained paths then
Experiments contrast that with RotatE, R-GCN and A2N revealing are further encoded as relational features and combined with a
great availability of OTE. logistic regression model to learn a binary log-linear classifier to
decide whether the given query relation exists between the entity
Concepts of Nearest Neighbors-based Inference Model (CN- pairs.
NIM) [156] performs LP recognizing similar entities among com- However, millions of distinct paths in a single classifier are
mon graph patterns by the use of Concepts of Nearest Neighbors generated by the PRA method, it may supervene with feature
[159], from where Dempster–Shafer theory [160] is adapted to explosion problem because each path is treated as an atomic
draw inferences. CNNIM only spends time in the inference step feature, which makes the atomic-path idea difficult to be adopted
because it abolishes training time-wasting to keep a form of by KGs with increasing relation types [166]. Additionally, since
31
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 16
Characteristics of introduced KGC methods using relational path information.
Model Technology Additional information Path selection strategy Datasets
Mh-KGC using atomic-path features:
PRA [164] Random walks Atomic path feature A single path NELL
SFE [165] Breadth-first search Atomic path feature A single path NELL
Non-atomic multi-hop reasoning:
PATH-RNN RNN + PRA, zero-shot reasoning Non-atomic and compositional path feature, Max pooling Freebase +
[166] arbitrary-length path ClueWeb
Trans-COMP Compositional training, path Non-atomic path feature A single path WordNet, Freebase
[143] compositional regularizer
Path-augmented translation models:
PTransE [12] PCRAa + TransE + path scoring Non-atomic path feature PCRA FB15K
RTransE [39] TransE + regularization Non-atomic path feature Focused on ‘‘unambiguous’’ paths: FB15K, FAMILY
composition ℓ1 : 1-to-1 or 1-to-many relations,
ℓ2 : 1-to-1 or many-to-1 relations
PTransD [117] Path-augmented TransD Path PCRA FB15K
Modeling paths using neural networks:
Single-Model Path-RNN; Shared Parameter Path, intermediate nodes, entity-types Scoring pooling: Top-K, Average and Freebase +
[167] Architecture LogSumExp ClueWeb
APCM [168] RNN + Attention Path, entity type Attentive Path Combination FC17
IRNs [169] Shared memory + controller Path, structured relation information Controller determines the length of paths WN18, FB15K
ROHP [163] Three ROPs architectures: GRUs Path Arbitrary-length path Freebase +
ClueWeb
PRCTA [170] RNN; constrained type attention; Path, entity and relation types Path-level attention Freebase +
relation-specific type constraints ClueWeb
mh-RGAN [96] RNN reasoning models + GAN Non-atomic path feature Generator G of GAN WordNet, FreeBase
Combine path information with type information:
All-Paths [171] Dynamic programming, considers Path, relation types Dynamic programming NCI-PID and
intermediate nodes WordNet.
RPE [172] Relation-specific type constraints; Path, relation type Reliable relation paths-selection strategy LP: FB15K; TC:
path-specific type constraints FB15K, FB13,
WN11
APM [173] Abstract graph + path Abstract paths, strongly typed relations Paths in abstract graph Freebase, NELL
Leveraging order information in paths:
OPTransE [174] TransE + Ordered Relation Paths Path, relation orders Path fusion: two layer pooling strategy WN18 and FB15K
PRANN [175] CNN + BiLSTM Path + entities/relations orders, entity Path-level Attention NELL995,
types FB15k-237,
Countries, Kinship
a
‘PCRA’: path-constraint resource allocation algorithm [176].
the graph at all — but the core mechanism of these three works various traditional KGC models to answer path queries. This tech-
continues to be a classifier based on atomic-path features. Be- nique is applicable to a broad class of combinable models that
sides, neither one can perform zero-shot learning because there include the bilinear model [13] and TransE [11], i.e, the score
must be a classifier for each predicted relation type in their function:
approaches.
s(s/r , t) = M(Tr (xs ), xt )
4.1.5.2. Non-atomic multi-hop reasoning. Some works explore to
represents a combinatorial form where the traversal operator
utilize path information as non-atomic features during a KGC
procedure.
Tr (xs ) means a path query (xs , r , ?) following Tr : Rd → Rd , and
operator M illustrates the incorporable model’s score operation
PATH-RNN [166] can not only jointly reason on the path, but follows M : Rd × Rd → R, for example, when cooperates with
also deduce into the vector embedded space to reason on the TransE, the traversal operator becomes to Tr (xs ) = xs + wr and
elements of paths in a non-atomic and combinatorial manner. the score function then turns into:
Using recursive neural networks (RNNs) [179] to recursively ap-
ply a composite function to describe the semantics of latent s(s/r , t) = M(Tr (xs ), xs ) = −∥Tr (xs ) − xs ∥22
relations over arbitrary length paths (in Fig. 22(a)), PATH-RNN so that it can handle a path query q = s/r1 /r2 /.../rk by:
finally produces a homologous path-vector after browsing a path.
PATH-RNN can infer from the paths not seen in the training s(q, t) = −∥xs + wr1 + · · · + wrk − xt ∥22
during the testing process, and can also deduce the relations that
The compositional training is regarded as providing a new form
do not exist in the KGs.
of structural regularization for existing models since it substan-
TransE-COMP [143] suggests a new compositional training ob- tially reduces cascading errors presented in the base vector space
jective that dramatically improves the path modeling ability of model.
33
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
4.1.5.3. Path-augmented translation models. The path-augmented which the latter item E(h, P , t) models the inference correla-
translation methods, which introduce multi-step path informa- tions between relations with multi-step relation path triples. In
tion into classical translation models, are developed. PTranasE, relation paths p ∈ P(h, t) are represented via se-
mantic composition of relation embeddings, by perform Addition,
PTransE [12] uses path information in its energy function as: Multiplication or RNN operation:
Multiplication : p = r1 · ... · rl
RNN : c1 = r1 , . . . , p = cn
Simply put, PTransE doubles the number of edges in the KG by
creating reverse relations for each existing relation. Then PTransE
uses a path-constraint resource allocation algorithm (PCRA) [176] to
select reliable input paths within a given length constraint.
RTransE [39] learns compositions of relations as sequences of
translations in TransE by simply reasoning among paths, in this
process, RTransE only considers a restricted set of paths of length
two. This paper augments the training set with relevant exam-
ples of the above-mentioned compositions, and training so that Fig. 23. Example of the meaning change when the order of relations is altered.
sequences of translations lead to the desired result.
PTransD [117] is a path-augmented TransD, it thinks relation
paths as translation between entities for KGC. Similar to TransD, Final path encoding leverages path-level attention to combine
PTransD considers entities and relations into different semantics useful paths and produces path representations.
spaces. PTransD uses two vectors to represent each entity and We collect some representative structures of methods that
relations, where one of them represents the meaning of a(n) en- model path information for KGC using RNNs in Fig. 22. There
tity (relation), and another one is used to construct the dynamic are other path-based KGC models using other neural network
mapping matrix. frameworks:
4.1.5.4. Modeling paths using neural networks. We can see that Multi-hop Relation GAN (mh-RGAN) [96] considers multi-hop
neural network is handy in modeling path, especially the RNN (mh) reasoning over KGs with a generative adversarial network
application lines: (GAN) instead of training RNN reasoning models. The mh-RGAN
consists of two antagonistic components: a generator G with
Single-Model [167] Based on the PATH-RNN [166], Single-Model respect to composing a mh-RP, and a discriminator D tasked with
discusses path-based complex reasoning methods extended by
distinguishing real paths from the fake paths.
RNN and jointly reasoning with within-path relations, entities,
and entity types in the paths. 4.1.5.5. Combining path information with type information. Some
methods consider type information of entities and relations when
Attentive Path Combination Model (APCM) [168] first generates
modeling path representations, such as [167,168], and [170].
path representations using an RNN architecture, then it assigns
discriminative weights to each path representations to form the Relational Path Embedding model (RPE) Lin et al. [172] extend
representation of entity pair, finally, a dot-product operation the relation specific type constraint to the new path specific type
between the entity pair representation and the query relation constraint, both two type constraints can be seamlessly incor-
representation is designed to compute the score of a candidate porated into RPE to improve the prediction quality. In addition,
query relation, so that it allows entity pair to get representation RPE takes full advantage of the semantics of the relation path to
with respect to query relations in a dynamic manner. explicitly model KGs. Using the composite path projection, RPE
Implicitly ReasonNets (IRNs) [169] designs a network archi- can embed each entity into the proposed path space to better
tecture, which performs multi-hop reasoning in vector space handle the relations with multiple mapping characteristics.
based on shared memory. The key highlight is the employment Abstract Path Model (APM) [173] focuses on the generation of
of shared memory that intelligently saves relevant large-scale abstract graph depending on the strongly typed relations and
structured relations information in an implicit manner, thus it can then develops a traversal algorithm for mining abstract paths
avoid explicit human-designed inference. IRNs reasons according in the produced intensional graph. Those abstract paths tend
to a controller to stipulate the inference step during the whole to contain more potential patterns to execute KG tasks such as
inference procedure simultaneously gets proper interaction with LP. The proposed abstract graph drastically reduces the original
shared memory. This work performs an excellent function on KGC graph size, making it becomes more tractable to process various
about complex relations. graphs.
Recurrent one-hop predictor Model (ROHP) [163] explores 4.1.5.6. Leveraging order information in paths. The order of re-
three ROHP architectures with the capability of modeling KG lations and entities in paths is also important for reasoning.
paths of arbitrary lengths by using recurrent neural networks As Fig. 23 shows, the meaning will change when the order of
(GRUs [180]) to predict entities in the path step by step for relations is altered [174].
multi-hop KG reasoning.
OPTransE [174] attaches importance to the order information
Path-based Reasoning with Constrained Type Attention
of relations in relation paths via projecting each relation’s head
(PRCTA) equipped with a constrained type attention mechanism
entity and tail entity into different vector spaces respectively.
for multi-hop path reasoning [170]. On the one hand, PRCTA
To capture the complex and nonlinear features hidden in the
encodes type words of both entities and relations to extract
paths, OPTransE designs a multi-flow of min-pooling layers. It
abundant semantic information by which partly improves the
was experimentally validated that OPTransE performs well in LP
sparsity issue, on the other hand, for reducing the impact of
task, directly mirroring the vital role of relation order information
noisy entity types, constrained type attention is designed to
in relation paths for KGC.
softly select contributing entity types among all the types of a
certain entity in various scenarios, meanwhile, relation-specific Path-based Reasoning with Attention-aware Neural Network
type constraints are made full use for enhancing entity encoding. (PRANN) [175] also uses the ordering of the local features to learn
35
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 17
Statistics of path-based KGC datasets FC (Freebase + ClueWeb) and FC17.
Datasets FC FC17
Entities 18M 3.31M
Freebase triples 40M 35M
ClueWeb triples 12M 104M
Relations 25,994 23 612
Relation types tested 46 46
Avg. paths/relation 2.3M 3.19M
Avg. training positive/query relation – 6621
Avg. training negative/query relation – 6622
Avg. training facts/relation 6638 –
Avg. positive test instances/relation 3492 3516
Avg. negative test instances/relation 43,160 43 777
Fig. 24. An illustration that relations between an entity pair can be inferred by
considering information available in multiple paths collectively [168].
Table 19
Experimental results of path-based KGC methods on NELL995, FB15k-237, Kinship and Countries datasets. The public performance data in this table comes from
[175]. Best results are in bold.
Model NELL995 FB15k-237 Kinship Countries
MRR Hits@1 Hits@3 MRR Hits@1 Hits@3 MRR Hits@1 Hits@3 MRR Hits@1 Hits@3
PRA [164] 0.696 0.637 0.747 0.412 0.322 0.331 0.799 0.699 0.896 0.739 0.577 0.9
Single-Model [167] 0.859 0.788 0.914 0.575 0.512 0.567 0.804 0.814 0.885 0.941 0.918 0.956
MINERVA [63] 0.879 0.813 0.931 0.615 0.49 0.659 0.824 0.71 0.937 0.96 0.925 0.995
PRANN [175] 0.898 0.838 0.951 0.66 0.544 0.708 0.952 0.918 0.984 0.947 0.916 0.986
Table 20
Link prediction results of path-based KGC methods on WN18 and FB15k datasets.
All the data in the table comes from [174]. Best results are in bold.
Model WN18 FB15k
Hits@10 MR Hits@10 MR
RTransE [39] – – 0.762 50
PTransE (ADD, 2-step) [174] 0.927 221 0.834 54
PTransE (MUL, 2-step) [174] 0.909 230 0.777 67
PTransE (ADD, 3-step) [12] 0.942 219 0.846 58 Fig. 26. An example of the robustness of rule reasoning shown in [191].
PTransD (ADD,2-step) [117] – – 0.925 21
RPE (ACOM) [172] – – 0.855 41
RPE (MCOM) [172] – – 0.817 43
IRN [169] 0.953 249 0.927 38 makes rule-based reasoning achieve high accuracy. Moreover,
OPTransE [174] 0.957 199 0.899 33 logical rules are interpretable enough to provide insight into the
results of reasoning, and in many cases, this excellent character
will lead to the robustness of the KGC transfer task. For example,
conducting rule reasoning over an increasing KG can avoid parts
of retraining work due to the addition of new nodes, which is
more adaptable than models modeled for certain entities within
a specific KG. Consider the scenario in Fig. 26, when we add
some new facts about more companies or locations to this KG,
the rules with respect to ‘HasOfficeInCountry’ will still be usefully
accurate without retraining. The same might not be workable for
methods that learn embeddings for specific KG entities, as is done
in TransE. In other words, logical rule-based learning can be applied
Fig. 25. Example of rules for KGC. The picture refers to [184]. to those ‘‘zero-shot’’ entities that cannot be seen during training.
The rules are manually or automatically constructed as various
logic formulas, each formula learns a weight by sampling or
counting grounding from existing KGs. These weighted formulas
has shown that the non-linear composition function outperforms
are viewed as the long-range interactions across several relations
linear functions (as used by them) for relation prediction tasks)
[185]. Manual rules are not suitable for large-scale KGs, on the
to select and expand the appropriate linear or non-linear model.
other hand, it is hard to cover all rules in the specific domain
KG by hand. Recently, rule mining has become a hot research
4.2. External extra information outside KGs
topic, since it can automatically induce logical rules from ground
In this section we comb KGC studies which exploit external facts, i.e., captures co-occurrences of frequent patterns in KGs to
information and mainly include two aspects: rule-based KGC determine logical rules [207,209] in a machine-readable format.
in Section 4.2.1 and third-party data source-auxiliary KGC in 4.2.1.2. Definition about logical rules based KGC. Formulaically, the
Section 4.2.2. KGC over rules we consider here consists of a query, an entity tail
that the query is about, and an entity head that is the answer to
4.2.1. Rule-based KGC the query [191]. The goal is to retrieve a ranked list of entities
Logical rules in KGs are non-negligible in that they can provide based on the query such that the desired answer (i.e., head) is
us expert and declarative information for KGC, they have been ranked as high as possible.
demonstrated to play a pivotal role in inference [185–187], and
hence are of critical importance to KGC. In this section we give a Formulation of Logical Rules: In terms of first-order logic [210,
systemic introduction of KGC tasks working with various rules, 211], given a logical rule, it is first instantiated with concrete
we also list a summary table for rule-based KGC methods as entities in the vocabulary E, resulting in a set of ground rules.
shown in Table 21. Suppose X is a countable set of variables, C is a countable set of
constants. A rule is of the form head ← body as follows formula,
4.2.1.1. Introduction of logical rules. An example of KGC with log- where head query(Y , X ) is an atom over R ∪ X ∪ C and body
ical rules is shown in Fig. 25. From a novel perspective [192], KGs Rn (Y , Zn ) ∧ ... ∧ R1 (Z1 , X ) is a conjunction of positive or negative
can be regards as a collection of conceptual knowledge, which can atoms over R ∪ X ∪ C .
be represented as a set of rules like BornIn(x, y) ∧ Country(y, z) →
Nationality(x, z), meaning that if person x was born in city y and y query(Y , X ) ← Rn (Y , Zn ) ∧ ... ∧ R1 (Z1 , X )
is just right in country z, then x is a citizen of z. Rules are explicit
where R1 , . . . , Rn are relations in the KGs.
knowledge (compared to a neural network), thus reasonable use
of logic rules is of great significance to handle problems in KGs. Ground Atom & Rule’s Grounding: A triple (ei , rk , ej ) can be
Rule-based KGC allows knowledge transfer for a specific domain taken as a ground atom which applies a relation rk to a pair of en-
by exploiting rules about the relevant domain of expertise, which tities ei and ej . When replacing all variables in a rule with concrete
38
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 21
Characteristics of introduced KGC methods using rules.
Model Technology Information Rules Dataset
Markov Logic Network (MLNs) series:
LRNNs [188] Standard feed-forward NN, First-order rules Function-free first-order logic 78 RL benchmarks
weighted first-order rules
MLN-based KGC Markov Logic Network, Rules – –
[189] mathematical axiom proof
ExpressGNN [190] GNNs+MLN, Logic rules, First order logical rules in MLN FB15K-237
solving zero-shot problem, entity information
EM algorithm,
mean-field approximation inference
End-to-end differentiable framework:
NuralLP [191] TensorLog, First-order logical Weighted chain-like logical rules WN18,
neural controller system (LSTM), rules FB15K,
attention mechanism FB15KSelected
RLvLR [192] Improves NuralLP, First-order logical CP rule: closed path rules FB75K,
RESCAL, rules WikiData
target oriented sampling
NTPs [193] RNN, First-order logical Function-free first-order logic rules, Countries,
backward chaining algorithm, rules parameterized rules, Kinship,
RBF kernel, unify rule, Nations,
ComplEx OR rule, UMLS
AND rule
NTP2.0 [194] NTPS, First-order logical Function-free first-order logic rules; Countries,
max pooling strategy, rules parameterized rules; Nations,
Hierarchical Navigable Small World (HNSW, a unify rule; Kinship,
ANNS structure) OR rule; UMLS
AND rule
DRUM [184] Open World Assumption, First-order logical – Family,
confidence score, rules UMLS,
BIRNN Kinship
Combining rule and embedding approach:
a. A shallow interaction:
r-KGE [185] ILP, Logical rules, Rule 1 (noisy observation); Location,
RESCAL/TRESCAL/TransE, physical rules Rule 2 (argument type expectation); Sport
four rules Rule 3 (at-most-one restraint);
Rule 4 (simple implication).
INS [195] MLNs, Paths, path rules FB15K
INS-ES, rules
TransE
ProRR-MF [196] ProPPR, First-order logical First-order logical rules FB15K,
matrix factorization, rules WordNet
BPR loss
b. Explore further combination style:
KALE [197] Translation hypothesis, Logic rules Horn logical rules WN18,
t-norm fuzzy logic FB122
Trans-rule [198] TransE/TransH/TransR, First-order logical Inference rules; WN18,
first-order logic space transformer, rules transitivity rules; FB166, FB15K
encode the rules in vector space, antisymmetry rules
confidence score with a threshold
c. Iteration interactions:
RUGE [199] Iterative model, Soft rules, Soft rules; FB15K,
soft label prediction, logic rules Horn logical rules YAGO37
embedding rectification,
confidence score
ItRI [200] KG embedding model, Feedback information Non-monotonic rules with negated FB15K,
iteratively learning, of KG embedding atoms; Wiki44K
pruning strategy, model text corpus, non-monotonic rules with
hybrid rule confidence measures non-monotonic rules partially-grounded atoms
IterE [201] Iterative model, OWL2 Language, 7 types of object property expression; WN18-s,
embedding representation, axioms information ontology axioms; WN18RR-s,
axiom induction, Horn logical rules FB15k-s,
axiom injection, FB15k-237-sa
confidence score,
linear mapping hypothesis
39
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 21 (continued).
Model Technology Information Rules Dataset
pLogicNet [202] MLN, First order logical First order logical rules in MLN: FB15k,
EM algorithm, rules Composition Rules, FB15k-237,
amortized mean-field inference, Inverse Rules, WN18,
KG embedding model Symmetric Rules, WN18RR
(TransE/ComplEx) Subrelation Rules
Text + Logic rules:
FEP-AdTE [203] Knowledge verification system, Logical information, First-order logic rules FGCNb KGs
TEP-based abductive text evidence, text information
remote supervision
Rules + Paths + Embedding approaches:
AnyBURL [204] Aleph’s bottom-up rule learning Fuzzy rules, Straight ground path rule: FB15(k),
uncertain rules, AC1 rules, FB15-237,
path AC2 rules, WN18,
C rules WN18RR,
YAGO03-10
ELPKG [205] KGE model, Path information, Probabilistic soft logic rules YAGO,
breadth first search for paths, logic rules NELL,
probabilistic logical framework PSL YAGO-50,
YAGO-rest
RPJE [206] KGE model, Logical rules, Horn rules for two modules: FB15K,
confidence score, path R1 : relation pairs association, FB15K-237,
compositional representation learning R2 : paths composition WN18, NELL-995
Filtering candidate triples:
AMIE+ [207] Open-world assumption, First-order logical Single chain of variable rules YAGO2 core,
pruning operations rules for Confidence approximation; YAGO2s,
PCA; DBpedia 2.0,
typed rules DBpedia 3.8,
Wikidata
CHAI [26] Complex rules normalizer Rules Complex rules base on relation FB13,
domain and distance; WN18,
4 types filtering candidates criteria NELL,
EPSRC
About evaluation:
RuleN [208] An unify evaluation framework, Logical rules Path rules Pn ; WN18,
evaluated with AMIE model C rules FB15k,
FB15k-237
a
‘-s’ means the ‘-sparse’ series datasets.
b
The ‘FGCN’ means Four Great Chinese Novels in China.
entities in KG, we get a grounding of the rule. A logical rule is (1). Inductive logic programming (ILP) for rule mining:
encoded, for example, in the form of ∀x, y : (x, rs , y) → (x, rt , y),
Inductive logic programming (ILP) [213] (i.e. XAIL) is a type
reflecting that any two entities linked by relation rs should also be
of classical statistical relational learning (SRL) [214], it proposes
linked by relation rt [197]. For example, a universally quantified new logical rules and is commonly used to mine logical rules
rule ∀x, y : (x, CapitalOf , y) → (x, LocatedIn, y) might be instan- from KGs. Although ILP is a mature field, mining logical rules
tiated with the concrete entities of Paris and France, forming the from KGs is difficult because of the open-world assumption KGs
ground rule (Paris, CapitalOf , France) → (Paris, LocatedIn, France). abide by, which means that absent information cannot be taken
A grounding with all triples existing in the KG is a support of this as counterexamples.
rule, and the ground rule can then be interpreted as a complex (2). Markov Logic Networks (MLNs) and its extensions:
formula, constructed by combining ground atoms with logical Often the underlying logic is a probabilistic logic, such as
connectives (e.g. ∧ and →). Markov Logic Networks (MLNs) [215] or ProPPR [216]. The
Logical Rules for KGC: To reason over KGs, for each query it is advantage of using probabilistic logic is that by equipping logical
usually interested in learning weighted chain-like rules of a form rules with probability, one can better statistically model complex
similar to stochastic logic programs [212]: and noisy data [191].
MLNs combines hard logic rules and probabilistic graphical
α query(Y , X ) ← Rn (Y , Zn ) ∧ ... ∧ R1 (Z1 , X ) models. The logic rules incorporate prior knowledge and allow
MLNs to generalize in tasks with a small amount of labeled
where α ∈ [0, 1] means the confidence associated with this rule. data, while the graphical model formalism provides a principled
In a generic sense, the inference procedure will define the score framework for dealing with uncertainty in data. However, infer-
of each y implies query (y, x) as the sum of the confidence of the ence in MLN is computationally intensive, typically exponential
rules for the given entity x, and we will return a ranked list of in the number of entities, limiting the real-world application of
entities where higher the score implies higher the ranking [191]. MLN. Also, logic rules can only cover a small part of the possible
4.2.1.3. Rule mining. Inferring the missing facts among existing combinations of KG relations, hence limiting the application of
entities and relations in the growing KG by rule-based inference models that are purely based on logic rules.
approaches has become a hot research topic, and how to learn Lifted Relational Neural Networks (LRNNs) [188] is a lifted
the rules used for KGC also catches people’s eye. There is a lot of model that exploits weighted first-order rules and a set of rela-
literature that takes many interests in rule learning technology. tional facts work together for defining a standard feed-forward
40
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
neural network, where the weight of rules can be learned by networks following a backward chaining algorithm referred in
stochastic gradient descent and it constructs a separate ground Prolog, performing inference by recursively modeling transitiv-
neural network for each example. ity relations between facts represented with vectors or tensors
using RNN. NTPs makes full use of the similarity of similar sub-
A Theoretical study of MLN-based KGC (MLN-based KGC) [189]
symbolic representations in vector space to prove queries and
explores the possibility that using MLN for KGC under the maxi-
induce function-free first-order logical rules, the learned rules are
mum likelihood estimation, it discusses the applicability of learn-
ing the weights of MLN from KGs in the case of missing data used to perform KGC even further. Although NTPs demonstrates
theoretically. In this work, it is proved by mathematical axiom better results than ComplEx in a majority of evaluation datasets,
proof that the original method, which takes the weight of MLNs it has less scalability compared to NeuralLP as the limitation of
learning on a given and incomplete KG as meaningful and correct computation complexity which considers all the proof paths for
(i.e. using the so-called closed world assumption), and predicts each given query.
the learned MLN on the same open KGs to infer the missing facts NTP 2.0 [194] whereby scales up NTPS to deal with real-world
is feasible. Based on the assumption that the missing triples are datasets cannot be handled before. After constructing the compu-
independent and have the same probability, this paper points out tation graph as same as NTPs, NTPs 2.0 employs a pooling strategy
that the necessary condition for the original reasoning method is to only concentrate on the most promising proof paths, reduc-
that the learning distribution represented by MLN should be as ing the solutions searching procedure into an Approximate Near-
close as possible to the data generating distribution. In particular, est Neighbor Search (ANNS) problem using Hierarchical Navigable
maximizing the log-likelihood of training data should lead to Small World (HNSW) [222,223].
maximizing the expected log-likelihood of the MLN model.
DRUM [184], an extensible and differentiable first-order logic rule
ExpressGNN [190] explores the combination of MLNs and popular mining algorithm, further improves NeuralLP by learning the rule
GNNs in KGC field, and applies GNNs into MLN variational reason- structure and the confidence score corresponding to the rule, and
ing. It uses GNNs to explicitly capture the structural knowledge establishes a connection between each rule and the confidence
encoded in the KG to supplement the knowledge in the logic score learned by tensor approximation, uses BIRNN to share use-
formula for predicting tasks. The compact GNNs allocates simi- ful information when learning rules. Although it makes up for
lar embedding to similar entities in the KG, while the express- the shortcomings of the previous inductive LP methods that have
ible adjustable embedding provides additional model capacity to poor interpretability and cannot infer unknown entities, DRUM
encode specific entity information outside the graph structure. is still developed on the basis of the Open World Assumption
ExpressGNN overcomes the scalability challenge of MLNs through of KGs and is limited to positive examples in training. In the
efficient stochastic training algorithm, compact posterior param- following research, it is necessary to further explore improved
eterization and GNNs. A large number of experiments show that
DRUM methods suitable for negative sampling, or try to explore
ExpressGNN can effectively carry out probabilistic logic reason-
the same combination of representation learning and differential
ing, and make full use of the prior knowledge encoded within
rule mining as methods [63,224].
logic rules while meet data-driven requirement. It achieves a
good balance between the representation ability and the simplic- 4.2.1.4. Combining rule-based KGC models with KGE models. The
ity of the model. In addition, it not only can solve the zero-shot rule-based KGC models provide interpretable reasoning and allows
problem, but also is a general enough which can balance the domain-specific knowledge transfer by using the rules about re-
compactness and expressiveness of the model by adjusting the lated professional fields. Compared to the representation model,
dimensions of GNNs and embedding. the rule-based models do not need a lot of high-quality data but
(3). End-to-end differentiable rule-based KGC methods: can achieve high accuracy and strong interpretability. However,
Based on these proposed basic rule-mining theories, a large they often face efficiency problems in large-scale search space;
amount of end-to-end differentiable rule-based KGC methods are while the embedding-based KGC models, i.e., the KGE models,
developed according to these types of rules. have higher scalability and efficiency but they have a flaw in
Neural Logic Programming (NeuralLP) [191] is an end-to-end dealing with sparse data due to their great dependence on data.
differentiable framework which combines first-order rules in- We summarize the advantages and disadvantages of rule-based
ference and sparse matrix multiplication, thus it allow us learn KGC and embedding-based KGC methods in a simplified table
parameters and structure of logical rules simultaneously. Addi- (Table 22). Therefore, there is no doubt that combining rule-based
tionally, this work establishes a neural controller system using reasoning with KGE models to conduct KGC will be noteworthy.
attention mechanism to properly allot confidences to the logical Please see Fig. 27 for a rough understanding of the researches of
rules in the semantic level, rather than merely ‘‘softly’’ generate combining rule information with KGE models.
approximate rules as mentioned in previous works [217–220], (1). A shallow interaction:
and the main function of the neural controller system is con- There are already some simple integrating works in the earlier
trolling the composition procedure of primitive differentiable attempts:
operations of TensorLog [221] in the memory of LSTM to learn r-KGE [185] is one of these methods, it tries to utilize ILP to in-
variable rule lengths. tegrate the embedding model (three embedding models: RESCAL,
RLvLR [192] aims at tackling the main challenges in the scalability TRESCAL, TransE) and four rules (including logical rules and phys-
of rule mining. Learning rules from KGs with the RESCAL embed- ical rules): rules are expressed as constraints of the maximization
ding technique, RLvLR guides rules mining by exploring in pred- problem, by which the size of embedding space is greatly re-
icates and arguments embedding space. A new target-oriented duced. r-KGE employs relaxation variables to model the noise
sampling method makes huge contributions to the scalability of explicitly, and a simple noise reduction method is used to reduce
RLvLR in inferring over large KGs, and the assessment work for the noise of KGs. But there are some disadvantages in this work:
candidate rules is handled by a suit of matrix operations referred it cannot solve n − to − n relations and the reasoning process is
to [207,209]. RLvLR shows a good performance both in the rules’s too time-consuming, especially for large KGs, which makes the
quality and the system scalability compared with NeuralLP. algorithm has poor scalability.
NTPs [193] is similar to NeuralLP, it focuses on the fusion of INS [195] is a data-driven inference method which naturally
neural networks and rule inferring as well, but models neural incorporates the logic rules and TransE together through MLNs
41
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 22
Statistics of pros and cons of rule-based KGC methods and embedding-based KGC methods.
Category Advantage Disadvantage
Rule-based KGC 1. Consider explicit logical semantics 1. Poor Scalability
2. Strong explainability and accuracy 2. Noise sensitive
3. Data dependency 3. High computational complexity
4. Can be applied to both transductive and inductive problems
5. High robustness avoiding re-training
Embedding-based KGC 1. High scalability 1. Data-driven
2. High efficiency 2. Poor explainability
3. Not affected by huge candidate sets 3. Hard to model the interaction of different relations
4. Cannot handle inductive scenarios
Fig. 27. Several typical KGC models which combine logical rules and embedding models, the development from (a) to (d) shows the process of deepening interaction
between rules and embedding models.
Source: These pictures are extracted from [52,185,197,199]
to conduct KGC, where TransE calculates the similarity score work is the first formal research on low dimensional embedding
between the candidate and the correct tag, so as to take the learning of first-order logic rules. However, it is still in a dilemma
top-N instances selection to form a smaller new candidate set, in predicting new knowledge since it has not combines entity,
which not only filters out the useless noise candidates, but also relation and rule embedding to cooperate symbolic reasoning
improves the efficiency of the reasoning algorithm. The calculated with statistical reasoning.
similarity score is used as a priori knowledge to promote further Nevertheless, although these several KGC methods jointly
reasoning. For these selected candidate cases, INS and its im- model with logical rules and embeddings, the rules involved in
proved version INS-ES [195] algorithm adopted in MLN network them are used merely as the post-processing of the embedding
is proposed to consider the probability of transition between methods, which leads to less advance in the generation of better
network sampling states during reasoning, therefore, the whole embedding representation [197].
reasoning process turns into supervised. It is worth noting that (2). Explore further combination style:
INS greatly improves the Hits@1 score in FB15K dataset. Different from previous approaches, the latter literatures ex-
pect to explore more meaningful combination ways, rather than
A Matrix Factorization Based Algorithm utilizing ProPPR
just jointly working on the surface level.
(ProRR-MF) [196] tries to construct continuous low dimensional
embedding representation for first-order logics from scratch, and KALE [197] is a very simple KGC model which combines the
is interested in learning the potential and distributed represen- embedding model with the logical rules, but pays attention to
tation of horn clause. It uses scalable probabilistic logic structure the deep interaction between rules and embedding methods. The
(ProPPR in [216]) learning to construct expressive and learnable main idea of KALE is to represent triples and rules in a unified
logic formulas from the large noisy real-world KGs, and applies framework, in which triples are represented by atomic formulas
a matrix factorization method to learn formula embedding. This and modeled by translation hypothesis; rules are represented by
42
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
complex formulas and modeled by t-norm fuzzy logic. Embedding the sparsity of KGs but also pays attention to the influence of
can minimize the overall loss of atomic formulas and complex semantics on rules. IterE proposes a new form of combining rule
formulas. In particular, it enhances the prediction ability of new and embedding representation, which provides a new idea for
facts that cannot be inferred directly from pure logic inference, KGC research combining different types of methods.
and it has strong generality for rules.
pLogicNet proposed by [202] is the product of cooperation be-
Trans-rule [198] is also a translation-based KG embedding and tween KG embedded model and MLN logic rules. Similar to IterE,
logic rules associative mode, what distinguishes this model from the operation process of pLogicNet is also carried out under the
previous similar works is that, it concerns about rules having deep interaction between embedding and rules. The difference is
confidence above a threshold, including inference rules, transitivity that in pLogicNet, a first-order Markov logic network is used to
rules and antisymmetry rules, these rules and their confidences define the joint distribution of all possible triples, then applies the
are automatically mined from triples in KG, then they are placed variant algorithm of EM algorithm to optimize pLogicNet. In the E
together with triples into a unify first-order logic space which al- step of the variant EM algorithm, the probability of unobserved
low rules encoded in it. Additionally, to avoid algebraic operations triples is deduced by using amortized mean-field inference, and
inconsistency problem, it maps all triples into first-order logics, the variation distribution is parameterized as the parameter of
and also defines kinds of interaction operations for rules to keep the KG embedding model; in M-step, the weights of the logic
the form of rules encoding to 1-to-1 mapping relation. rules are updated by defining the pseudo-likelihood on both the
(3) Iteration interactions observed triples and the triples inferred from the embedding
With the emergence of new completion demands, a new way model. PLogicNet can effectively use the stochastic gradient de-
of jointly learn rules and embeddings for KGC in a iteration scent algorithm to train. The training process iteratively performs
manner comes into being. E-step and M-step until convergence, and the convergence speed
of the algorithm is very satisfactory.
RUGE [199] is a novel paradigm of KG embedding model which
combines the embedding model with logic rules and exploits
4.2.1.5. Cooperating rules with other information. (1) Cooperating
guidance from soft rules in an iterative way. RUGE enables the
with abductive text evidence
embedding model to learn both labeled and unlabeled triples in
exiting KG, the soft rules with different confidence levels can be TEP-based Abductive Text Evidence for KGC (FEP-AdTE) [203]
acquired automatically from the KG at the same time. Vary from combines logical information and text information to form a new
the previous studies, this work first applies a iterative manner knowledge verification system, adding new fact triples to KGs.
to deeply capture the interactive nature between embedding The main idea of this paper is to define the explanation of triples
learning and logical inference. The iterative procedure can auto- — the form of (triples, windows) abductive text evidence based
matically extracted beneficial soft rules without extensive manual on TEP, in which the sentence window w explains the degree of
effort that are needed in the conventional attempts which always the existence of the triple τ , and uses the remote supervision
use hard rules in a one-time injection manner. Each iteration method in relation extraction to estimate the abductive text
contains two stage: soft label prediction and embedding rectifi- evidence. FEP-AdTE considers only the subset-minimal abductive
cation, the two partial responsible for approximately reasoning, explanation (called Mina explanation) to make the explanation
predicting and updating the KG with the newly predicted triples as concise as possible and applies the hypothesis constraint to
for further better embeddings in the next iteration respectively. limit the number of Mina explanations to be calculated to make
Though the whole iteration procedure, this flexible approach can the interpretation work possible. It is worth mentioning that this
fully divert the rich knowledge contained in logic rules to the paper has developed KGs corresponding to the text corpus of
learned embeddings. Moreover, RUGE demonstrates the useful- four Chinese classics to evaluate the new knowledge verification
ness of automatically extracted soft rules according a series of mechanism of this paper. However, the triple interpretation in
experiments. this paper does not contain valuable entity-type attributes. In
future work, we can consider adding pragmatic interpretation
Iterative Rules Inducing (ItRI) [200] iteratively extends induced
of entity types to further enhance the verification effect of new
rules guided by feedback information of the KG embedding model
knowledge and make contributions to KGC.
calculated in advance (including probabilistic representations of
(2) Cooperating with path evidence
missing facts) as well as external information sources, such as text
corpus, thus the devised approach not only learns high quality AnyBURL [204] can learn logic rules from large KGs in a bottom-
rules, but also avoids scalability problems. Moreover, this machin- up manner at any time. AnyBURL is further designed as an ef-
ery is more expressive through supporting non-monotonic rules fective KG rule miner, the concept of example is based on the
of negated atoms and partially grounded atoms. interpretation of path in KGs, which indicates that KGs can be
formed into a group of paths with edge marks. In addition, Any-
IterE [201] recursively combines the embedding model and rules
BURL learns fuzzy, uncertain rules. Because the candidate ranking
to learn the embedding representation as well as logic rules.
can be explained by the rules that generate the ranking, AnyBURL
IterE mainly consists of three parts: embedding representation,
has good explanatory power. In addition to the other advantages
axiom induction, axiom injection, and the training is carried out
of rule-based KGC, the additional advantages of AnyBURL are its
by interactive iteration among these three parts so that rules
fast running speed and less use of resources. In addition, AnyBURL
and embedding can promote each other to the greatest extent,
proves that rule learning can be effectively applied to larger KBs,
forming the final reasoning framework. Specifically, on the one
which overturns the previous bias against the rule-based KGC
hand, the embedding model learns from the existing triples in
method.
KGs as well as the triples inferred from the rules. On the other
hand, the confidence score of axioms derived from the pruning ELPKG [205] combines path information, embedding representa-
strategy should be calculated on the learned relational embed- tion and soft probability logic rules together. In a word, the KG
dings according to the linear mapping hypothesis, and then new embedding model is used to train the representation of inter-
triples can be inferred by the axioms. Finally, the new triples entity relation, and breadth-first search is used to find the path
are linked into KGs for following entity embedding learning. between entity nodes. The representation of entity/relation based
The recursive operation designed by IterE not only alleviates on path information is combined with the representation based
43
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 25
Table 24 Statistics about other datasets for KGC using rules.
An KGC example using rules referred to [208]. In this instance, four
Dataset Entity Relation Fact
relevant rules for the completion task (h, r , ?) resulting in the ranking
(g(0.81), d(0.81), e(0.23), f (0.23), c(0.15)). A rule can generate one candidate #Train #Valid #Test
(fourth row), several candidates (first and third row), or no candidate (second NELL-995 [206] 75,492 200 123,370 15,000 15,838
row). DRC [203] 388 45 333 – 34 530
Rule Type Confidence Result JW [203] 104 21 106 – 27 670
r(x, y) ≤ s(y, x) P1 0.81 {d,g} OM [203] 156 38 178 – 34 010
r(x, y) ≤ r(y, x) P1 0.7 φ RTK [203] 123 30 132 – 29 817
r(x, y) ≤ t(x, z) ∧ u(z , y) P2 0.23 {e,f,g} FB122 [197] 9738 122 91,638 9595 5057+6186
r(x, c) ≤ ∃yr(x, y) C 0.15 {c} FB166 [198] 9658 166 100,289 10,457 12,327
YAGO [205] 192 628 51 192 900
NELL [205] 2 156 462 50 2 465 372
YAGO-50 [205] 192 628 50 100 774
YAGO-rest [205] 192 628 41 92 126
Sport [185] 447 5 710
Besides, the work in [184] uses two theorems to learn rule struc- Location [185] 195 5 231
tures and appropriate scores simultaneously. However, this is a Countries [225] 244+23 5 1158
challenge because the method needs to find an optimal structure
in a large discrete space and simultaneously learn proper score
values in a continuous space. Due to the process of evaluating
candidate rules in a rule mining system is generally challenging are constructed from existing text corpora in a domain about
and time-consuming, [192] reduces its computation to a series character relationships in the four great classical masterpieces of
of matrix operations. This efficient rule evaluating mechanism Chinese literature, namely Dream of the Red Chamber (DRC), Jour-
allows the rule mining system to handle massive benchmarks ney to the West (JW), Outlaws of the Marsh (OM), and Romance of
efficiently. Meilicke et al. [208] presents a unified fine-grained the Three Kingdoms (RTK) [203]. Triples in those KGs are collected
evaluation framework that commonly assesses rule-based infer- on character relationships from e-books for these masterpieces,
ring models over the datasets generally used for embedding- yielding four KGs each of which corresponds to one masterpiece.
based models, making the effort to observe the valuable rules and
interesting experiences for KGC. Consider the rule’s confidence 4.2.1.8. Analysis of rule-based KGC methods. In summary, we ana-
as well, since when we use relevant rules for the complete task lyze some tips about experiment rule-based KGC methods on the
(h, r , ?), a rule can generate a variable number of candidate, and common benchmark. Referring to the generated results in [208],
the possible ways of aggregating the results generated by the which allow for a more comprehensive comparison between var-
rules are various. The work in [208] defines the final score of ious rule-based methods and embedding-based approaches for
an entity as the maximum confidence scores of all rules that KGC, employing a global measure to rank the different methods.
generated this entity. Furthermore, if a candidate has been gen- On this basis, we gained several interesting insights:
erated by more than one rule, they use the amount of these 1. Both AMIE and RuleN perform competitively to embedding-
rules as a secondary sorting attribute among candidates with the based approaches for the most common benchmarks. This holds
same (maximum) score. For instance in the Table 24, if there for the large majority of models reported about in [226]. Only a
are four relevant rules for completing (h, r , ?) and resulting in few of these embedding models perform slightly better.
the final ranking (g(0.81), d(0.81), e(0.23), f (0.23), c(0.15)). To 2. Since the rule-based approaches can deliver an explanation for
support the evaluation system, this paper designs a simplified the resulted ranking, the characteristic can be helpful to conduct
rule-based model called RuleN for assessing experiments and fine-grained evaluations and understand the regularities within
evaluated together with the AMIE model. With the inspiring and the hardness of a dataset [204].
results of experiments showing that models integrating multiple 3. The traditional embedding-based KGC methods may have mat-
different types of KGC approach deserve to be attracted attention ters in solving specific types of completion tasks whereas it can
in KGC task, this paper further classifies test cases of datasets for be solved easily with rule-based approaches, this tip becomes
fine-grained evaluation according to the interpretation generated
even more important when the situations looking solely at the
by the rule-based method, then gets a series of observations
top candidate of the filtered ranking.
about the partitioning of test cases in datasets.
4. One reason for the good results of rule-based systems is the
Datasets: Table 25 list the basic statistics information about com- fact that most standard datasets are dominated by rules such as
mon used datasets for rule-based KGC research. Here we intro- symmetry and (inverse) equivalence (except for those especially
duce several datasets in detail. constructed datasets, e.g., FB15k-237).
NELL: NELL datasets (http://rtw.ml.cmu.edu/rtw/resources) and 5. It is quite possible to leverage both families of approaches
its subsets are likely to be used as experimental data, including by learning an ensemble [185,195–199,202] to achieve better
NELL-995 [206], Location and Sport [185]. results than any of its members. The overall ensemble models
tend to contain a closed-loop operation, which indicates that the
FB122: composed of 122 Freebase relations [197] regarding the embedding expression and rules are mutual achievements with
topics of ‘‘people’’, ‘‘location’’, and ‘‘sports’’, extracted from FB15K.
each other. In the future, it is necessary to explore more effective
FB122’s test set are further split into two parts test-I and test-II,
interaction ways for integrating these two categories approaches.
where the former contains triples that cannot be directly inferred
6. Recently, novel effective but complex KG encoding models
by pure logical inference, and the latter the remaining test triples.
emerge in endlessly, which also provides alternative techniques
Countries: a dataset introduced by [225] for testing reasoning for KGC to combine knowledge embedding and rules in the future.
capabilities of neural link prediction models [193]. Triples in
Countries are (countries(c), regions(r), subregions(sr)) and they are 4.2.2. Third-party data sources-based KGC
divided into train, dev and test datasets which contain 204, 20 Some related techniques learn entity/relation embeddings
and 20 countries data. from triples in a KG jointly with third-party data sources, in
KGs about Four great classical masterpieces of Chinese liter- particular with the additional textual corpus (e.g., Wikipedia
ature (FGCN): new KGs and the corresponding logical theories articles) for getting help from related rich semantic information.
45
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 26
Statistics of popular KGC models using third-party data sources.
Models Technology Information (Data Source) Datasets
Joint alignment model:
JointAS [15] TransE, skip-gram, Structural information, Freebase subset;
words co-occurrence, entities names, English Wikipedia
entity-words co-occurrence Wikipedia anchors
JointTS [121] TransE, skip-gram, Structural information, FB15K,
JointAS entities names Freebase subset;
Wikipedia text descriptions, Wikipedia articles
textual corpus
DKRL [94] TransE, CBOW, CNN, Structural information, FB15K,
max/mean-pooling multi-hop path, FB20K
entity descriptions
SSP [227] TransE, topic extraction, Structural information, FB15K;
Semantic hyperplane Projection entity descriptions Wikipedia corpuses
Prob-TransE (or TransE/TransD, CNN, Structural information, FB15K;
TransD) semantics-based attention mechanism entity descriptions, NYT-FB15K
JointE (or JointD) anchor text, textual corpus
[228]
JOINER [229] TransE, Structural information, Freebase subset;
regularization, textual corpus, English Wikipedia
JointAS Wikipedia anchors
ATE [230] TransE, BiLSTM, Skip-Gram, Relation mentions and entity descriptions, Freebase, WordNet;
mutual attention mechanism textual corpus English Wikipedia (Wiki)
aJOINT [162] TransE, KG structural information, WN11, WN18, FB13, FB15k;
collaborative attention mechanism textual corpus Wikipedia articles
KGC with Pre-trained Language Models (PLMs):
JointAS [15], DESP word2vec Structural information, textual information FB15K, FB20K
[121], DKRL [94]
LRAE [231] TransE, PCA, word2vec Structural information, FB15k,
entity descriptions WordNet
RLKB [232] Probabilistic model, Structural information, FB500K, EN15K
single-layer NN entity descriptions
Jointly-Model TransE, CBOW/LSTM, Attention, Structural information, FB15K,
[233] Gate Strategy entity descriptions WN18
KGloVe-literals Entity recognition, Textual information in properties, Cities, the AAUP, the Forbes,
[234] KGloVe textual corpus the Metacritic Movies,
the Metacritic Albums;
DBpedia abstracts
Context Graph Context graph, CBOW, Skip-Gram Analogy structure, DBpedia
Model [235] semantic regularities
KG-BERT [236] BERT, sequence classification Entity descriptions, WN11, FB13, FB15K,
entity/relation names, WN18RR, FB15k-237, UMLS;
sequence order in triples, textual corpus Wikipedia corpuses
KEPLER [237] RoBERTa [238], masked language modeling KG structural information, FB15K, WN18, FB15K-237,
(MLM) entity descriptions, WN18RR; Wikidata5M
textual corpus
BLP [239] BERT, holistic evaluation framework, inductive KG structural information, FB15K-237, WN18RR;
LP, TransE, entity descriptions, Wikidata5M
DistMult, ComplEx, and SimplE textual corpus
StAR [240] RoBERTa/BERT, KG structural information, WN18RR, FB15k-237,
multi-layer perceptron (MLP), entity descriptions, ULMS, NELL-One; Wikipedia
Siamese-style textual encoder textual corpus paragraph
Next, we will systematically introduce KGC studies that use third- distributional vector space through lexical-semantic analogical
party data source, we also list them in Table 26 for a direct inferences. Secondly, under the Open-world Assumption, a missing
presentation. fact often contains entities out of the KG, e.g., one or more entities
are phrases appearing in web text but not included in the KG
4.2.2.1. Research inspiration. This direction is inspired by these yet [15]. While only relying on the inner structure information
three key items: Firstly, pre-training language models (PLMs) is hard to model this scene, the third-party textual datasets can
such as Word2Vec [75], ELMo [241], GPT [242], and BERT [243], provide satisfied assistance for dealing with these out-of-KG facts.
have caused the upsurge in the field of natural language process- Thirdly, similar to the last point, auxiliary textual information
ing (NLP) which can effectively capture the semantic information such as entity descriptions can help to learn sparsity entities,
in text. They originated in a surprising found that word repre- which act as the supplementary information of these entities
sentations that are learned from a large training corpus display lacking sufficient messages in the KG to support learning.
semantic regularities in the form of linear vector translations The most striking textual information is entity description,
[75], for example, king − man + w oman ≈ queen. Such a struc- very few KGs contain a readily available short description or
ture is appealing because it provides an interpretation of the definition for each of the entities or phrases, such as WordNet
46
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
and Freebase, and usually it needs the additional lexical resources textual descriptions must interact with triples for better embed-
to provide textual training. For instance, in a medical dataset with ding. SSP can model the strong correlations between symbolic
many technical words, the Wikipedia pages, dictionary defini- triples and textual descriptions by performing the embedding
tions, or medical descriptions via a site such as ‘medilexicon.com’ process in a semantic subspace.
could be leveraged as lexical resources [236].
Prob-TransE and Prob-TransD [228] jointly learns the repre-
sentation of the entities, relations, and words within a unified
4.2.2.2. Joint alignment model. JointAS [15] jointly embeds entities parameter sharing semantic space. The KG embedding process
and words into the same continuous vector space. Entity names incorporates TransE and TransD (called Prob-TransE and Prob-
and Wikipedia anchors are utilized to align the embeddings of en- TransD) as representative in the framework to handle representa-
tities and words in the same space. Numerous scale experiments tion learning of KGs, while the stage of representation learning of
on Freebase and a Wikipedia/NY Times corpus show that jointly textual relations applies CNN to embed textual relations. A recip-
embedding brings promising improvement in the accuracy of rocal attention mechanism consists of knowledge based attention
predicting facts, compared to separately embedding KGs and text. and the semantics attention (SATT) are proposed to enhance the
Particularly, JointAS enables the prediction of facts containing KGC. The attention mechanism can be simply described as fol-
entities out of the KG, which cannot be handled by previous lows: during the KG embedding process, semantic information
embedding methods. The model is composed of three compo- extracted from text models can be used to help explicit relations
nents: the knowledge model LK , text model LT , and alignment to fit more reasonable entity pairs, similarly, additional logi-
model LA which make the use of entity names LAN and Wikipedia cal knowledge information can be utilized to enhance sentence
anchors LAA , thus the overall objective is to maximize this jointly embedding and reduce the disadvantageous influences of noisy
likelihood loss function: generated in the process of distant supervision. The experiments
use anchor text annotated in articles to align the entities in KG and
L = LK + LT + LA
entities mentions in the vocabulary of the text corpus, and build
where LA could be LAA or LAN or LAN + LAA , and the score function the alignment between relations in KGs and text corpus with the
s(w, v ) = b − 12 (∥w − v∥2 ) of a target word w appearing close to idea of distant supervision. A series of comparative experiments
a context word v (within a context window of a certain length) prove that the joint models (JointE+SATT and JointD+SATT) have
for text model while the score function s(h, r , t) = b − 12 (∥vh + effective performances through trained without strictly aligned
vr − vt ∥2 ) for KG model, in which the b is a bias constant. text corpus. In addition to that, this framework is adaptable and
Although this alignment model goes beyond previous KGE flexible which is open to existing models, for example, the partial
methods and can perform prediction on any candidate facts of TransE and TransD can be replaced by the other KG embedding
between entities/words/phrases, it has drawbacks: using entity methods similar to them such as TransH and TransR.
names severely pollutes the embeddings of words; using JOINER [229] jointly learns text and KG embeddings via regu-
Wikipedia anchors completely relies on the special data source larization. Preserving word–word co-occurrence in a text corpus
and hence the approach cannot be applied to other customer data. and transition relations between entities in a KG, JOINER also can
JointTS [121] takes these above-mentioned issues into consid- use regularization to flexibly control the amount of information
eration, without dependency on anchors, it improves alignment shared between the two data sources in the embedding learning
model LA based on text descriptions of entities by considering both process with significantly less computational overhead.
conditional probability of predicting a word w given entity e ATE [230] carries out KGE using both specific relation mention
and predicting a entity e when there is a word w . This model and entity description encoded with a BiLSTM module. A mutual
learns the embedding vector of an entity not only to fit the struc- attention mechanism between relation mentions and entity de-
tured constraints in KGs but also to be equal to the embedding scriptions is designed to learn more accurate text representation,
vector computed from the text description, hence it can deal to further improve the representation of KG. In the end, the final
with words/phrases beyond entities in KGs. Furthermore, the new entity and relation vectors are obtained by combining the learned
alignment model only relies on the description of entities, so that text representation and the previous traditional translation-based
it can obtain rich information from the text description, thus well representation. This paper also considers the fuzziness of entity
handles the issue of KG sparsity. and relation in the triple, filters out noisy text information to
DKRL [94] is the first work to build entity vectors directly ap- enrich KG embedding accurately.
plying entity description information. The model combines triple aJOINT [162] proposes a new cooperative attention mechanism,
information with entity description information to learn vectors based on this mechanism, a text-enhanced KGE model was pro-
for each entity. The model efficiently learns the semantic em- posed. Specifically, aJOINT enhances KG embeddings through the
bedding of entities and relations relying on the CBOW and CNN text semantic signal: the multi-directional signals between KGE
mechanism and encodes the original structure information of and text representation learning were fully integrated to learn
triples with the use of TransE. Experiments on both KGC and more accurate text representations, so as to further improve the
entity classification tasks verify the validity of the DKPL model in structure representation.
expressing new entities and dealing with zero-shooting cases. But
4.2.2.3. KGC with pre-trained language models. Recently,
it should not be underestimated that DKRL tune-up needs more
pre-trained language models (PLMs) such as ELMo [241],
hyper-parameters along with extra storage space for inner layers’
Word2Vec [75], GPT [242], BERT [243], and XLNet [244] have
parameters.
shown great success in NLP field, they can learn contextualized
Semantic Space Projection (SSP) [227] is a method for KGE with word embedding with large amount of free text data and achieve
text descriptions modifying TransH. SSP jointly learns from the excellent performance in many language understanding tasks
symbolic triples and textual descriptions, which builds interac- [236].
tion between these two information sources, at the same time According to the probable usage of PLMs in KGC tasks, the
textual descriptions are employed to discover semantic relevance related approaches can be roughly divided into two categories
and offer precise semantic embedding. This paper firmly con- [236]: feature-based and fine tuning approaches. Traditional
vinced that triple embedding is always the main procedure and feature-based word embedding methods like Word2Vec and
47
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Glove [92] aim to learn context-independent word vectors. ELMo infer new unobserved triples from existing triples. Excerpts from
generalized traditional word embedding to context-aware word large input graphs are regarded as the simplified and meaningful
embedding, where word polysemy can be properly handled. context of a group of entities in a given domain. Next, based on
Mostly, these word embeddings learned from them are often the context graph, CBOW [75] and Skip-Gram [245] models are
used as initialization vectors during the KGC process. Different used to model KG embedding and perform KGC. In this method,
from the former method, fine-tuning approaches such as GPT the semantic rules between words are preserved to adapt to
and BERT use the pre-trained model structure and parameters as entities and relationships. Satisfactory results have been obtained
the starting point of specific tasks (KGC task we care about). The in some specific field.
pre-trained model learns rich semantic patterns from free text. The well-known BERT [243] is a prominent PLM by pre-
training the bidirectional Transformer encoder [246] through
Lexical Resources Auxiliary Embedding Model (LRAE) [231] ex-
masked language modeling and next sentence prediction. It can
plores methods to provide vector initialization for TransE by using
capture rich linguistic knowledge in pre-trained model weights.
the semantic information of entity description text. LRAE exploits
entity descriptions that are available in WordNet and Freebase As this basis, a number of KGC models try to exploit BERT or its
datasets. The first sentence of a given entity description is first variants for learning knowledge embedding and predicting facts:
selected and then decomposed into a series of word vectors KG-BERT [236] treats entity and relation descriptions of triples
(the first sentence is often most relevant to the described en- as textual sequences inputting to BERT framework, and natu-
tity, which avoids noise interference and large-scale computation rally regards KGC problems as corresponding sequence classi-
from lengthy description text), next all those vectors are averaged fication problems. KG-BERT computes the scoring function of
to form embeddings that represent the overall description seman- serialized triples with a simple classification layer. During the
tics of the entity, where word vectors are computed by Word2vec BERT fine-tuning procedure, they can obtain high-quality triple
[75] and GloVe [92]. These processed descriptive text vectors representations, which contain rich semantic information.
are used as the initialization vectors of the translation model
and are input to TransE for training. LRAE provides initialization KEPLER [237] encodes textual entity descriptions with RoBERTa
vectors for all entities, even including those not present in the [238] as their embedding, and then jointly optimizes the KG
data, thus it alleviates the entity sparse issue. Also, LRAE is very embeddings and language modeling objectives. As a PLM, KEPLER
versatile and can be applied directly to other models whose input can not only integrate factual knowledge into language repre-
is represented by solid vectors. sentation with the supervision from KG, but also produce effec-
tive text-enhanced KG embeddings without additional inference
RLKB [232] modifies DKRL by developed a single-layer proba- overhead compared to other conventional PLMs.
bilistic model that requires fewer parameters, which measures
the probability of each triple and the corresponding entity de- BLP [239] proposes a holistic evaluation framework for entity
scription, obtains contextual embeddings of entities, relations, representations learned via the inductive LP. Consider entities not
and words in the description at the same time by maximizing a seen during training, BLP learns inductive entity representations
logarithmic likelihood loss. based on BERT, and performs LP in combination with four dif-
ferent relational models: TransE, DistMult, ComplEx, and SimplE.
Jointly-Model [233] proposes a novel deep architecture to uti- BLP also provides evidence that the learned entity representations
lize both structural and textual information of entities, which transfer well to other tasks (such as entity classification and
contains three neural models to encode the valuable information information retrieval) without fine-tuning, which demonstrates
from the text description of entity: Bag-of-Words encoder, LSTM that the entity embeddings act as compressed representations
encoder and Attentive LSTM encoder, among which an attentive of the most salient features of an entity. This is additionally
model can select related information as needed, because some important because having generalized vector representations of
of the words in an entity’s description may be useful for the KGs is useful for using them within other tasks.
given relation, but may be useless for other relations. The Jointly-
Model chooses a gating mechanism to integrate representations Structure-augmented text representation (StAR) [240]
of structure and text into a unified architecture. augments the textual encoding paradigm with KGE techniques to
learn KG embeddings for KGC. Following translation-based KGE
Including Text Literals in KGloVe (KGloVe-literals) [234] com- methods, StAR partitions each triple into two asymmetric parts.
bines the text information in entity attributes into KG embed- These parts are then encoded into contextualized representa-
dings, which is a preliminary exploration experiment based on tions by a Siamese-style textual encoder. To avoid combinatorial
KGloVe: it firstly performs KGloVe step to create a graphical co- explosion of textual encoding approaches, e.g., KG-BERT, StAR
occurrence matrix by conducting a personalized PageRank (PPR) employs a scoring module involves both deterministic classifier
on the (weighted) graph; at the same time, it extracts information and spatial measurement for representation and structure learn-
from the DBpedia summary by performing Named Entity Recog- ing respectively, which also enhances structured knowledge by
nition (NER) step, in which the words representing the entity are exploring the spatial characteristics. Moreover, StAR presents a
replaced by the entity itself, and the words surrounding it (and self-adaptive ensemble scheme to further boost the performance
possibly other entities) are contained in the context of the entity; by incorporating triple scores from existing KGE models.
then the text co-occurrence matrix is generated in collaboration
with the list of entities and predicates generated in the KGloVe 4.2.2.4. Discussion on KGC using third-party data source. Based
step. Finally, a merge operation is performed to combine the on the above introduction of KGC using the third-party data
two co-occurrence matrices to fuse the text information into the source (almost all are textual corpus), we give our corresponding
latent feature model. Although the gain of this work is very small, analysis as follows:
it can provide new ideas for the joint learning of attribute text
1. In a narrow sense, this part of KGC studies emphasize the uti-
information and KG embedding.
lize of additional data source outside KGs, but you may be aware
Context Graph Model [235] finds hidden triples by using the that these literals tend to apply PLMs in their works, which takes
observed triples in incomplete graphs. This paper is based on us to think about the application of ‘third party data’ in a broader
the neural language embedding of context graph and applies sense: these PLMs either possess plenty of parameters which have
the similar structure extracted from the relation similarity to trained on large scale language corpus, or provide ready-made
48
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 27
Statistics of a part of TKGC technologies.
Model Loss functiona Whether consider time periods Datasets
Temporal order dependence models:
TransE-TAE [252] Lmarg no YAGO2
Diachronic embedding models:
DE-Simple [253] Sampled Lmll No ICEWS14, ICEWS15-05, GDELT15/16
ATiSE [254] Self-adversarial Lns Yes ICEWS14, ICEWS05-15, Wikidata12k, YAGO11k
Temporal Information embedding models
TTransE [255] Lmarg No Wikidata
HyTE [256] Sampled Lmll Yes Wikidata12k, YAGO11k
ConT [257] LBRL No ICEWS14,GDELT
TA-DisMult [258] Sampled Lmll No YAGO-15k, ICEWS14, ICEWS05-15, Wikidata
TNT-ComplEx [259] Instantaneous Lmll Yes ICEWS14, ICEWS15-05, YAGO-15k, Wikidata40k
Dynamic evolution models:
Know-Evolve [260] Conditional intensity function No GDELT, ICEWS14
RE-NET [261] Total classification LCE No ICEWS18, GDELT18
GHN [262] Total classification LCE No ICEWS18, GDELT15/16
TeMP [263] Sampled Lmll No ICEWS14, ICEWS05-15, GDELT
a
As usual, LCE , Lmarg and Lns refers to cross entropy loss, margin-based ranking loss and negative sampling loss respectively. Besides, the Lmll means the multiclass
log-loss, and LBRL refers to the binary regularized logistic loss.
semantically-rich word embeddings, thus when we say a KGC models. Naturally, a summary table is made to sum up all the
work uses a PLM, we would think about it gets assistance from TKGC methods introduced in our overview (Table 27).
the additional language information (from other large language
Temporal Knowledge Graphs (TKGs) and TKGC: For such KGs
corpora, on which the PLM has been fully trained). In other words,
with temporal information, we generally call them TKGs. Natu-
we should not judge a KGC model whether use third-party data
rally, the completion of such KGs is called TKGC, and the original
source merely according to their used datasets, it is especially
triples are redefined as quadruples (h, r , t , T ) where T is the time
important to focus on the details of the model most of the time.
(which can be a timestamp or a time span as [Tstart , Tend ]) [252].
2. As we have discussed in 4.2.2.1, PLMs have an important role With studying time-aware KGC problems, it helps to achieve
in capturing rich semantic information which is helpful to KGC. more accurate completion results, i.e., in LP task, we can dis-
Along with a growing number of assorted PLMs are proposed, in tinguish which triple is real in a given time condition, such as
particular, the models jointly learn language representation from (Barack Obama, President of , USA, 2010) and (Bill Clinton,
both KGs and large language corpus, some PLM models intro- President of , USA, 2010). In addition, some literature also pro-
duce structure data of KGs into the pre-training process through poses time prediction task that predicting the most likely time for
specific KGC tasks to obtain more reasonable language model the given entity and relation by learning the time embeddings vT .
parameters (such as ERNIE [247], CoLAKE [248–251]). In the fu- According to the usage manner of temporal information, we
ture, to explore an efficient joint learning framework derive entity roughly categorize recent TKGC methods into four groups: tem-
representations from KGs and language corpus may be needed, poral order dependence model, diachronic embedding model,
and the key point is how to design the interaction between these temporal information embedding model and dynamic evolu-
two data source, an iterative learning manner, just as the Rule- tion model.
KG embedding series worked, maybe a possible future direction.
What is needed is a method to derive entity representations that 5.1.1. Temporal order dependence models
work well for both common and rare entities. The mentioned temporal order information indicates that un-
der the time condition, some relations may follow a certain order
5. Other KGC technologies timeline, such as BornIn → WorkAt → DiedIn.
TransE-TAE [252] firstly incorporates two kinds of temporal in-
In this part we focus on several other KGC techniques oriented formation for KG completion: (a) temporal order information and
at the special domain, including Temporal Knowledge Graph (b) temporal consistency information. To capture the temporal
Completion (TKGC) in Section 5.1 that concerns time elements in order of relations, they tend to design a temporal evolving matrix
KGs; CommonSense Knowledge Graph Completion MT , with which a prior relation can evolve into subsequent rela-
(CSKGC) which is a relatively new field about commonsense KGs tion (as Fig. 28 shows). Specifically, given two facts having same
studying (see Section 5.2), and Hyper-relational Knowledge Graph head entity (ei , r1 , ej , T1 ) and (ei , r2 , ek , T2 ), it assumes that prior
Completion (HKGC) that pays attention to n-ary relation form relation r1 projected by MT should be near subsequent relation r2 ,
instead of usual 2-nary triples in KGs (see Section 5.3). i.e., r1 MT ≈ r2 . In this way, TransE-TAE allows to separate prior
relation and subsequent relation automatically during training.
5.1. Temporal Knowledge Graph Completion (TKGC) Note that the temporal order information finally is treated as
a regularization term injected into original loss function, being
At present, many facts in KGs are affected by temporal infor- optimized together with KG structural information.
mation, owing to the fact in the real world are not always static
but highly ephemeral such as (Obama, Presidentof , USA) is true 5.1.2. Diachronic embedding models
only during a certain time segment. Intuitively, temporal aspects This kind of models often design a mapping function from
of facts should play an important role when we perform KGC time scalar to entity or relation embedding, input both time and
[252]. In this section, we briefly introduce some famous TKGC entity/relation into a specific diachronic function framework to
49
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 28
Statistic of several Temporal Knowledge Graph datasets.
Dataset Entity Relation Fact Timestamps
#Train #Valid #Test
Time Slot-based dataset
Wikidata [258] 11 134 95 121 442 14 374 14 283 1726 (1 year)
Wikidata12K [256] 12 554 24 32.5k 4k 4k 232 (1 year)
YAGO11K [256] 10 623 10 16.4k 2k 2k 189 (1 year)
YAGO15K [258] 15 403 34 110 441 13 815 13 800 198 (1 year)
Fact-based dataset
ICEWS 14 [253] 7128 230 72 826 8941 8963 365 (1 day)
ICEWS 18 [253] 23 033 256 373 018 45 995 49 545 304 (1 day)
ICEWS 05-15 [253] 10 488 251 386 962 46 275 46 092 4017 (1 day)
GDELT(15-16) [253] 500 20 2 735 685 341 961 341 961 366 (1 day)
GDELT(18) [261] 7691 240 1,734,399 238,765 305,241 2751 (15 min)
Table 29
Evaluation results of TKGC on ICEWS14, ICEWS05-15 and GDELT datasets. Best results are in bold.
ICEWS14 ICEWS05-15 GDELT
MRR Hits@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10
TransE [11] 0.280 0.094 – 0.637 0.294 0.090 – 0.663 0.113 0.0 0.158 0.312
DisMult [42] 0.439 0.323 – 0.672 0.456 0.337 – 0.691 0.196 0.117 0.208 0.348
SimplE [44] 0.458 0.341 0.516 0.687 0.478 0.359 0.539 0.708 0.206 0.124 0.220 0.366
ComplEx [43] 0.456 0.343 0.516 0.680 0.483 0.366 0.543 0.710 0.226 0.142 0.242 0.390
TTransE [255] 0.255 0.074 – 0.601 0.271 0.084 – 0.616 0.115 0.0 0.160 0.318
HyTE [256] 0.297 0.108 0.416 0.655 0.316 0.116 0.445 0.681 0.118 0.0 0.165 0.326
TA-DistMult [258] 0.477 0.363 – 0.686 0.474 0.346 – 0.728 0.206 0.124 0.219 0.365
ConT [257] 0.185 0.117 0.205 0.315 0.163 0.105 0.189 0.272 0.144 0.080 0.156 0.265
DE-TransE [253] 0.326 0.124 0.467 0.686 0.314 0.108 0.453 0.685 0.126 0.0 0.181 0.350
DE-DistMult [253] 0.501 0.392 0.569 0.708 0.484 0.366 0.546 0.718 0.213 0.130 0.228 0.376
DE-SimplE [253] 0.526 0.418 0.592 0.725 0.513 0.392 0.578 0.748 0.230 0.141 0.248 0.403
ATiSE [254] 0.545 0.423 0.632 0.757 0.533 0.394 0.623 0.803 – – – –
TeMP-GRU [263] 0.601 0.478 0.681 0.828 0.691 0.566 0.782 0.917 0.275 0.191 0.297 0.437
TeMP-SA [263] 0.607 0.484 0.684 0.840 0.680 0.553 0.769 0.913 0.232 0.152 0.245 0.377
TNTComplEx [259] 0.620 0.520 0.660 0.760 0.670 0.590 0.710 0.810 – – – –
YAGO dump, and kept all facts involving those entities. Then, this 5.1.6. Analysis of TKGC models
collection of facts are augmented with time information from the Inspired by the excellent performance of translation model
yagoDateFacts dump. Contrary to the ICEWS data sets, YAGO15K and tensor factorization model in traditional KGC, temporal
does contain temporal modifiers, namely, ‘occursSince’ and ‘oc- knowledge graph completion (TKGC) mainly introduces temporal
cursUntil’ [258]. What is more, all facts in YAGO15K maintain embedding into the entity or relation embedding based on the
time information in the same level of granularity as one can find above two kinds of KGC ideas. Recently, with the wide application
in the original dumps these datasets come from, this is different of GCN in heterogeneous graphs, more and more TKGC methods
from [255]. adopt the idea of ‘‘subgraph of a TKG’’ [261] we call it temporal
subgraph, which aggregate the neighborhood information at each
YAGO11k [256] is a rich subgraph from YAGO3 [267], includ- time, and finally collaborate with the sequence model RNN to
ing top 10 most frequent temporally rich relations of YAGO3. complete the time migration between subgraphs. Future methods
By recursively removing edges containing entities with only a may continue to explore the construction of temporal subgraphs
single mention in the subgraph, YAGO11k can handle sparsity and show solicitude for the relevance between time subgraphs.
effectively and ensure healthy connectivity within the graph. In addition, more attention may be paid to the static information
Wikidata Similar to YAGO11k, Wikidata contains time interval that existed in TKG, so as to promote the integration of TKGC and
information. As a subset of Wikidata, Wikidata12k is extracted traditional KGC methods.
from a preprocessed dataset of Wikidata proposed by [255], its
created procedure follows the process as described in YAGO11k, 5.2. CommonSense Knowledge Graph Completion (CSKGC)
by distilling out the subgraph with time mentions for both start
and end, it ensures that no entity has only a single edge connected CommonSense knowledge is also referred as background knowl-
to it [256], but it is almost double in size to YAGO11k. edge [268], it is a potentially important asset towards building
Performance Results Comparison: We report some published ex- versatile real-world AI applications, such as visual understand-
perimental results about TKGC methods in Table 29, from which ing for describing images (e.g., [269–271]), recommendation sys-
we find that TeMP-SA and TeMP-GRU achieve satisfying results tems or question answering (e.g., [272–274]). Whereby a novel
on all three datasets across all evaluated metrics. Compared to kind of KGs involve CommonSense knowledge is emerged, Com-
the most recent work TNTComplex [259] — which achieves the monSense knowledge graphs (CSKGs), we naturally are inter-
ested in the complement of CSKGs, here give a presentation of
best performance on the ICEWS datasets before TeMP, are 8.0%
series CommonSense Knowledge Graph Completion (CSKGC)
and 10.7% higher on the Hits@10 evaluation. Additionally, TeMP
techniques. The corresponding summary table involves described
also achieves a 3.7% improvement on GDELT compared with DE,
CSKGC methods shown in Table 30.
the prior state-of-the-art on that dataset, while the results of the
AtiSEE and TNTComplEx methods on the GDELT dataset are not CommonSense knowledge graphs (CSKGs) almost provide a
available. confidence score along with every relation fact, for representing
51
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 30
Statistics of recent popular CommonSense KGC technologies.
Model Technology Information Datasets
Language Auxiliary CSKGC Models with Pre-trained Language Models:
NAM [64] Neural Association Model, Large unstructured texts CN14
neural networks:
DNN and relation-modulated neural nets (RMNN),
probabilistic reasoning,
PLMs: skip-gram
DNN-Bilinear DNN, Text phrases ConceptNet 100K
[275] Bilinear architecture,
averaging the word embeddings (DNN AVG, Bilinear AVG),
max pooling of LSTM (DNN LSTM, Bilinear LSTM),
PLMs: skip-gram
CSKGC-G [268] DNN AVG in [275], Text phrases ConceptNet
attention pooling of DNN LSTM, 100K,
bilinear function, JaKB
defining CSKG generation task
COMET [276] Automatic CSKG generation, CSKG structure and relations ConceptNet,
adaptable framework, ATOMIC
GPT,
multiple transformer blocks of multi-headed attention
MCC [277] End-to-end framework, Graph structure of local ConceptNet,
encoder: GCNs + fine-tuned BERT, neighborhood, ATOMIC
decoder: ConvTransE, semantic context of nodes in KGs
A progressive masking strategy
CSKGC with Logical Rules:
UKGEs [278] Uncertain KGE, Structural and uncertainty ConceptNet,
probabilistic soft logic information CN15k,
of relation facts NL27k,
PPI5k
DICE [279] ILP (Integer linear programming), CommonSense knowledge ConceptNet,
weighted soft constraints, statements (four dimensions), Tuple-KB,
the theory of reduction costs of a relaxed LP, taxonomic hierarchy related concepts Qasimodo
joint reasoning over CommonSense,
knowledge statements sets
J
∑ exp(ej ) 5.2.3. CSKGC with logical rules
vx = ∑J hiddenjx , (x = h, t)
exp(ek ) uncertain KGEs (UKGEs) [278] explores the uncertain KGE ap-
j=1 K =1
proaches, including CSKGC research. Preserving both structural
ek = w T tanh(Whiddenkx ), (x = h, t) and uncertainty information of triples in the embedding space,
UKGEs learns embeddings according to the confidence scores
of uncertain relation facts and further applies probabilistic soft
vht = Bilinear(vh , vt )
logic to infer confidence scores for unseen relation facts during
training.
vin = concat(vht , vr )
Diverse CommonSense Knowledge (DICE) [279] is a
Except for the commonly used variable, the J means the word multi-faceted method with weighted soft constraints to couple
length of phrase h (or t), w is a linear transformation vector for the inference over concepts (that are related in a taxonomic hier-
j j
calculating the attention vector. Besides, vx and hiddenx are the jth archy) for deriving refined and expressive CommonSense knowl-
word embedding and hidden state of the LSTM for phrase x, (x = edge. To capture the refined semantics of noisy CommonSense
h, t). Another highlight in [268] is that it develops a commonsense knowledge statements, they consider four dimensions of concept
knowledge generation model which shares information with the properties: plausibility, typicality, remarkability and saliency, and
CSKGC part, its framework is shown in Fig. 29. This devised model the coupling of these dimensions by a soft constraint
model jointly learns the completion and generation tasks, which system, which expresses inter-dependencies between the four
improves the completion task because triples generated by the CommonSense knowledge dimensions with three kinds of logical
generation model can be used as additional training data for the constraints: Concept-dimension dependencies, Parent–child depen-
completion model. In this way, this work allows to increase the dencies and Sibling dependencies, enabling effective and scalable
node size of CSKGs and increase the connectivity of CSKGs. joint reasoning over noisy candidate statements. Note that the
53
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 32 Table 33
Statistic of CommonSense Knowledge Graph datasets. Summary about CSKGC models.
Dataset Entity Relation Fact Model Completion Generation
√
#Train #Val1 #Val2 #Test NAMs [64] √ √
ATOMIC 256,570 9 610,536 – – – DNN-Bilinear [275] √ √
CN14 159,135 14 200,198 5000 – 10,000 CKGC-G [268] √ √
JaKB 18,119 7 192,714 13,778 – 13,778 COMET [276] √
CN-100K 78,088 34 100,000 1200 1200 2400 MCC [277] √
CN15k 15,000 36 241,158 UKG embedding [278] √
NL27 27,221 404 175,412 Dice [279]
PPI5k 4999 7 271,666
Table 34
CommonSense KGC (CSKGC) evaluation on CN-100K and ATOMIC with subgraph sampling [277]. The baselines are presented in the top of the table, the middle part
shows the KGC results of COMET and the bottom half are the model implementations in [277].
Model CN-100K ATOMIC
MRR Hits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10
DISTMULT 8.97 4.51 9.76 17.44 12.39 9.24 15.18 18.3
COMPLEX 11.4 7.42 12.45 19.01 14.24 13.27 14.13 15.96
CONVE 20.88 13.97 22.91 34.02 10.07 8.24 10.29 13.37
CONVTRANSE 18.68 7.87 23.87 38.95 12.94 12.92 12.95 12.98
COMET-NORMALIZED 6.07 0.08 2.92 21.17 3.36 0 2.15 15.75
COMET-TOTAL 6.21 0 0 24 4.91 0 2.4 21.6
BERT + CONVTRANSE 49.56 38.12 55.5 71.54 12.33 10.21 12.78 16.2
GCN + CONVTRANSE 29.8 21.25 33.04 47.5 13.12 10.7 13.74 17.68
SIM + GCN + CONVTRANSE 30.03 21.33 33.46 46.75 13.88 11.5 14.44 18.38
GCN + BERT + CONVTRANSE 50.38 38.79 56.46 72.96 10.8 9.04 11.21 14.1
SIM + GCN + BERT + CONVTRANSE 51.11 39.42 59.58 73.59 10.33 8.41 10.79 13.86
Table 35
Statistics of recent popular hyper-relational KGC technologies.
Model Hyper-relational fact representation Information Technology Task
(for n-ary fact(h, r , t) with (ki , vi ))
m-TransH [291] {(rh , rt , k1 , . . . , kn ), n-ary key–value pairs A direct modeling framework Predict entities
(h, t , v1 , . . . , vn )} for embedding multifold relations,
fact representation recovering,
TransH
RAE [112] {(rh , rt , k1 , . . . , kn ), n-ary key–value pairs m-TransH, Predict entities
(h, t , v1 , . . . , vn )} relatedness between entities,
instance reconstruction
NaLP [292] {rh : h, rt : t , ki : vi }, n-ary key–value pairs CNN, Predict entities
i = 1, . . . , n key–value pairs relatedness predict relations
HINGE [8] (h, r , t), Triple data, CNN, Predict entities
{rh : h, rt : t , ki : vi }, i = 1, . . . , n n-ary key–value pairs triple relatedness, key–value pairs relatedness predict relations
Table 36
Statistic of popular hyper-relational datasets.
Dataset Entity Relation #Train #Valid #Test
Binary N-ary Overall Binary N-ary Overall Binary N-ary Overall
JF17K 28,645 322 44,210 32,169 76,379 – – – 10,417 14,151 24,568
WikiPeople1 47,765 707 270,179 35,546 305,725 33,845 4378 38,223 33,890 4391 38,281
WikiPeople2 34 839 375 280 520 7389 287 918 – – – 36 597 971 37 586
but also extracting further useful information from correspond- tails) in some facts, Rosso et al. [8] further filters out these non-
ing key–value pairs simultaneously. HINGE also applies a neural entity literals and the corresponding facts. Table 36 involves the
network framework equipped with convolutional structures, just main statistics of these two versions of WikiPeople datasets. Each
likes the network of [292]. of these datasets contains both triple facts and hyper-relational
facts.
5.3.3. Negative sampling about hyper-relational data
Performance Comparison of HKGC Models: To get an understand-
A commonly adopted negative sampling process on HKGs is
ing of the HKGC performance of existing models, we refer to
randomly corrupting one key or value in a true fact. For example,
the newest public KGC results for learning from hyper-relational
for an n-ary relational fact representation {rh : h, rt : t , ki : vi }, i =
1, . . . , n, when corrupting the key rh by a randomly sampled facts in [8] (shown in Table 37). We observe that HINGE [8]
rh′ (r , r ′ ), the negative fact becomes {rh′ : h, rt : t , ki : vi }, i = consistently outperforms all other models when learning hyper-
1, . . . , n. However, this negative sampling process is not fully relational facts, even performs better than the best-performing
adaptable to its n-ary representation of hyper-relational facts, it baseline NaLP-Fix [292], which shows a 13.2% improvement on
is unrealistic in especial for keys rh and rt , as rh′ is not compatible the link prediction (LP) task, and a 15.1% improvement on the
with rt while only one relation r (or r ′ ) can be assumed between relation prediction (RP) task on WikiPeople (84.1% and 23.8% on
h and t in a hyper-relational fact [8]. Therefore, an improved JF17K, respectively). Also, from Table 37 we can see NaLP shows
negative sampling method is proposed to fix this issue in [8]. better performance than m-TransH and RAE, since it learns the
Specifically, when corrupting the key rh by a randomly sampled relatedness between relation-entity pairs while m-TransH and
rh′ (r , r ′ ), the novel negative sampling approach also corrupts rt by RAE learn from entities only.
rt′ , resulting in a negative fact {rh′ : h, rt′ : t , ki : vi }, i = 1, . . . , n. Moreover, Rosso et al. [8] noted that m-TransH and RAE result
Subsequently for this negative fact, only a single relation r ′ links in very low performance on WikiPeople, which may be probably
h and t. Similarly, when corrupting rt , we also corrupt rh in the due to the weak presence of hyper-relational facts in WikiPeo-
same way. This new process is more realistic than the original ple while m-TransH and RAE are coincidentally designed for
one. hyper-relational facts. Besides, it is obvious that NaLP-Fix (with a
fixed negative sampling process) consistently shows better per-
5.3.4. Performance analysis of HKGC models formance compared to NaLP, with a slight improvement of 2.8% in
head/tail prediction, and a tremendous improvement of 69.9% in
Datasets: As we have discussed, the hyper-relational data is one
RP on WikiPeople (10.4% and 15.8% on JFK17K, respectively), this
natural fact style in KGs. For uniformly modeling and learning, a
result verifies the effectiveness of fixed negative sampling process
KG usually is represented as a set of binary relational triples by
proposed in [8], in particular for RP.
decomposing n-ary relational facts into multiple triples relying on
In addition, the baseline methods learning from
adding virtual entities, such as in Freebase, a so-called particular
hyper-relational facts (i.e., m-TransH, RAE, NaLP and NaLP-Fix)
‘‘star-to-clique’’ (S2C) conversion procedure to transform non-
surprisingly yield worse performance in many cases than the
binary relational data into binary triplets on filtered Freebase data
best-performing baseline which learns from triples only [8]. They
[291]. Since such procedures have been verified to be irreversible
further explain that the ignorance of the triple structure results
[291], so that it causes a loss of structural information in the
in this subpar performance, because the triple structure in KGs
multi-fold relations, in other words, this kind of transformed tra-
preserves essential information for KGC.
ditional triple datasets are no longer adaptable to n-ary relational
fact learning. Therefore the special datasets for HKGs embedding
and completion are built as follows: 6. Discussion and outlook
JF17K [291] extracts from Freebase. After removing the entities 6.1. Discussion about KGC studies
involved in very few triples and the triples involving String,
Enumeration Type, and Numbers, JF17K recovers a fact repre- According to a series of systematic studies about recently KGC
sentation from the remained triples. During fact recovering, it works, we discuss several major lights as follows:
firstly removes facts from meta-relations which have only one 1. About Traditional KGC Models: With the KGC technology
single role. Then JF17K randomly selects 10 000 facts from each going to be mature, the traditional translation model, decom-
meta-relation containing more than 10 000 facts. According to position model and neural network model in this field tend to
two instance representation strategies, JF17K further constructs become more and more commonly used as baseline KGC tools to
two instance representations Tid (F ) and T (F ) where F means the integrate other technologies for promising efficient and effective
resulting fact representation from previous steps. Next, the final KGC research.
dataset is built by further applying filtering on Tid (F ) and T (F ) 2. About Optimization Problem: It is absolutely necessary to
into G, Gid , randomly splitting along with original instance repre- pay attention to the optimization method. A proper optimization
sentation operation s2c(G). These resulting datasets are uniformly
method can make it faster or more accurately to get solution. The
called JF17K, we give their statistics in Table 36.
modeling of optimization objective also determines whether the
WikiPeople [292] extracts WikiPeople from Wikidata and fo- KGC problem has a global or local optimal solution, or in some
cuses on entities of type human without any specific filtering cases, it can improve the situation that is easy to fall into the local
to improve the presence of hyper-relational facts. The original optimal solution (suboptimal solution), which is not conducive to
WikiPeople dataset version in [292] also contains literals (used as the KGC task.
57
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
Table 37
The performance of several HKGC methods on WikiPeople and JF17K [8].
Method WikiPeople JF17K
Head/Tail prediction Relation prediction Head/Tail prediction Relation prediction
MRR Hit@10 Hit@1 MRR Hit@10 Hit@1 MRR Hit@10 Hit@1 MRR Hit@10 Hit@1
m-TransH 0.0633 0.3006 0.0633 N/A 0.206 0.4627 0.206 N/A
RAE 0.0586 0.3064 0.0586 N/A 0.2153 0.4668 0.2153 N/A
NaLP 0.4084 0.5461 0.3311 0.4818 0.8516 0.3198 0.2209 0.331 0.165 0.6391 0.8215 0.5472
NaLP-Fix 0.4202 0.5564 0.3429 0.82 0.9757 0.7197 0.2446 0.3585 0.1852 0.7469 0.8921 0.6665
HINGE 0.4763 0.5846 0.4154 0.95 0.9977 0.9159 0.4489 0.6236 0.3611 0.9367 0.9894 0.9014
3. About Regularization and Constraints: During a specific of enriching the knowledge of the internal KG with external
model learning, proper regularization and constraints, as well information, which in turn feeds back the training information
as the skills of super-parameter tuning can make the trained to the encoding side module based on both external informa-
model achieves unexpected results. Although this is an empirical tion and internal KG’s data while continuously replenishing the
work step even maybe with potential threatens (for example, ‘‘knowledge’’ of the KG.
N3 normalization [50] will require larger embedded dimensions, 2. Rule-based KGC is Promising. As are introduced in our pa-
some optimization techniques (e.g., Tucker [55]) may require a per, rule-based approaches perform very well and are a compet-
large number of parameters, and thus the resulting scalability or itive alternative to popular embedding models. For that reason,
economical issues need to be considered), we should attach im- they have promise to be included as a baseline for the evaluation
portant to the model tuning works. Relevant attention has been of KGC methods and it has been recommended that conducting
raised in previous works [50], officially doubting the question that the evaluation on a more fine-grained level is necessary and
whether the parameters are not adjusted well or the problem instructive for further study about KGC field in the future.
of the model itself should be responsible for a bad performance 3. Try the New PLMs is Feasible. Obviously, the endless new
needs to be studied and experimented continuously, emphasiz- pre-training language models (PLMs) make it unlimited possi-
ing that model tune-up works are as important as optimization bilities to combine effective language models with various text
model itself. information for obtaining high-quality embeddings and capturing
4. About Joint Learning Related to KGC: We conclude that abundant semantic information to complete KGs.
the joint KGC models that jointly learn distinct components tend 4. There is a Plenty of Scopes for Specific-Fields-KGC. The
to develop their energy function in a composition form. The emergence of new KGs in various specific fields stimulate the
Joint KGC methods usually extend the original definition of triple completion research on the specific field KGs. Although the exist-
energy (distance energy, similarity energy, etc.) to consider the ing KGC works concerning the KGs for specific fields and demands
new multimodality representations. is yet relatively rare (for example, there are only a few or a
5. About Information Fusion Strategies: We also conclude dozen of literature studying the completion of CommonSense KGs
several common experiences here. One of them is that when and Hyper-Relational KGs), KGC for specific field KGs is exactly
it comes to the internal combination of the same kind of in- meaningful with great practical application value, which will be
formation (such as collecting useful surrounding graph context further developed in the future.
as effective as possible for learning the proper neighbor aware 5. Capture Interaction between Distinct KGs will be Help-
representation, the combination between different paths of an ful to KGC. A series of tasks have emerged with the need of
entity pair, etc.), attention mechanism along with various neural interaction between various KGs, such as entity alignment, entity
network structure is an appropriate fusion strategy at the most disambiguation, attribute alignment and so on. When it comes
cases. Moreover, draw lessons from NLP field, RNN structure is to the multi-source knowledge fusion, the research of heteroge-
suitable for dealing with sequence problems. For example, when neous graph embedding (HGE) and multilingual Knowledge Graph
considering the path modeling, the general applied neural net- Embedding (MKGE) has gradually attracted much attention, which
work structure is RNN [96,166–168], and [163], as well as in the are not covered in our current review. KGC under multi-KGs
situation that utilizing textual information (especially the long interaction could evolve as a sub-direction for the future develop-
text sequence) for KGC. ment of KGC, which may create some inspiring ideas by studying
6. Embedding-based Reasoning and Rule-based Reasoning: the unified embedding and completion of different types and
As we have introduced and analyzed in our work, both rule-based structures of knowledge. By the way, the KGC work with respect
reasoning and embedding-based reasoning have their separate to multilingual KGs is insufficient, it is worth launching this
advantages and disadvantages. Under this case, researchers tend research direction to replenish the multilingual KGs demanded
to make the cooperation between these two kinds of KGC models in real applications.
expecting to exert both of their superiorities sufficiently. 6. Select More Proper Modeling Space. A novel opinion indi-
cates that modeling space of KG embedding does not have to be
6.2. Outlook on KGC limited in European space as most literatures do (TransE and its
extensions), on the contrary, as KGs possess an intrinsic charac-
We give the following outlooks depending on our observation teristic of presenting power-law (or scale-free) degree distribu-
and overview in this paper: tions as many other networks [293,294], there have been shown
1. A Deep-level Interaction is Beneficial for KGC. In the that scale-free networks naturally emerge in the hyperbolic space
aspect of adding additional information for KGC, especially those [295] . Recently, the hyperbolic geometry was exploited in vari-
extra information outside KGs, such as the rules and external text ous works [296–298] as a means to provide high-quality embed-
resources we mentioned, a peeping research trend is exploring a dings for hierarchical structures instead of in ordinary European
deep-level interactive learning between external knowledge and space. The work in [295] illustrated that hyperbolic space has
internal knowledge. That is, designing a model jointly with a the potential to perform significant role in the task of KGC since
combination of parameter sharing and information circulation, it offers a natural way to take the KG’s topological information
even employing an iterative learning manner to achieve the goal into account. This situation inspires researchers to explore more
58
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
effective and reasonable embedding vector space for KGC to CRediT authorship contribution statement
implement the basic translation transformation or tensor decom-
position of entities and relations, the expected model space could Tong Shen: Classification, Comparisons and analyses, Perfor-
be able to easily model complex types of entities and relations, mance evaluation, Writing – original draft, Revision responses.
along with various structural information. Fu Zhang: Classification, Writing – review & editing, Revision
7. Explore the Usage of RL in KGC. Reinforcement learning responses. Jingwei Cheng: Writing – review & editing.
(RL) has seen a variety of applications in NLP including machine
translation [299], summarization [300], and semantic parsing Declaration of competing interest
[301]. Compared to other applications, RL formulations in NLP and
KGs tend to have a large action space (e.g., in machine translation The authors declare that they have no known competing finan-
and KGC, the space of possible actions is the entire vocabulary of cial interests or personal relationships that could have appeared
a language and the whole neighbors of an entity, respectively) to influence the work reported in this paper.
[302]. On this basis, more recent work formulates multi-hop
reasoning as a sequential decision problem, and exploits rein- Acknowledgments
forcement learning (RL) to perform effective path search [63,141,
303,304]. Under normal circumstances, a RL agent is designed to
The authors sincerely thank the editors and the anonymous
find reasoning paths in the KG, which can control the properties
reviewers for their valuable comments and suggestions, which
of the found paths rather than using random walks as previous
improved the paper. The work is supported by the National
path finding models did. These effective paths not only can be
Natural Science Foundation of China (61672139) and the Funda-
used as an alternative to Path Ranking Algorithm (PRA) in many
mental Research Funds for the Central Universities, China (No.
path-based reasoning methods, but also mainly be treated as
reasoning formulas [303]. In particular, some recently studies N2216008).
apply human-defined reward functions, a foreseeable future is
to investigate the possibility of incorporating other strategies References
(such as adversarial learning [27]) to give better rewards than
[1] Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard
human-defined reward functions. On the other hand, a discrimi- de Melo, Claudio Gutierrez, José Emilio Labra Gayo, Sabrina Kirrane,
native model can be trained to give rewards instead of designing Sebastian Neumaier, Axel Polleres, et al., Knowledge graphs, 2020, arXiv
rewards according to path characteristics. Additionally, in the preprint arXiv:2003.02320.
future, RL framework can be developed to jointly reason with KG [2] Wanli Li, Tieyun Qian, Ming Zhong, Xu Chen, Interactive lexical and se-
triples and text mentions, which can help to address the prob- mantic graphs for semisupervised relation extraction, IEEE Trans. Neural
Netw. Learn. Syst. (2022).
lematic scenario when the KG does not have enough reasoning
[3] Ming Zhong, Yingyi Zheng, Guotong Xue, Mengchi Liu, Reliable keyword
paths. query interpretation on summary graphs, IEEE Trans. Knowl. Data Eng.
8. Multi-task learning about KGC. Multi-task learning (MTL) (2022).
[305] is attracting growing attention which inspires that the com- [4] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kon-
bined learning of multiple related tasks can outperform learning tokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey,
each task in isolation. With the idea of MTL, KGC can learn Patrick Van Kleef, Sören Auer, et al., Dbpedia–a large-scale, multilingual
knowledge base extracted from wikipedia, Semant. Web 6 (2) (2015)
and train with other KG-based tasks (or properly designed an-
167–195.
cillary tasks) by the MTL framework, which could gain both [5] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, Jamie Taylor,
representability and generalization by sharing the common in- Freebase: a collaboratively created graph database for structuring human
formation between the tasks in the learning process, to achieve knowledge, in: ACM SIGMOD International Conference on Management
overall performance. of Data, 2008, pp. 1247–1250.
[6] George A. Miller, WordNet: a lexical database for English, Commun. ACM
7. Conclusion 38 (11) (1995) 39–41.
[7] Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum, Yago: a core of
semantic knowledge, in: WWW, 2007, pp. 697–706.
With this overview, we tried to fill a research gap about a [8] Paolo Rosso, Dingqi Yang, Philippe Cudré-Mauroux, Beyond triplets:
systematic and comprehensive introduction of Knowledge Graph hyper-relational knowledge graph embedding for link prediction, in: The
Completion (KGC) works and shed new light on the insights Web Conference, 2020, pp. 1885–1896.
gained in previous years. We make up a novel full-view catego- [9] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin
Murphy, Thomas Strohmann, Shaohua Sun, Wei Zhang, Knowledge vault:
rization, comparison, and analyzation of research on KGC studies.
A web-scale approach to probabilistic knowledge fusion, in: ACM SIGKDD
Specifically, in the high-level, we review KGs in three major as- KDD, 2014, pp. 601–610.
pects: KGC merely with internal structural information, KGC with [10] Antoine Bordes, Xavier Glorot, Jason Weston, Yoshua Bengio, A semantic
additional information, and other special KGC studies. For the matching energy function for learning with multi-relational data, Mach.
first category, KGC is reviewed under Tensor/matrix factorization Learn. 94 (2) (2014) 233–259.
models, Translation models, and Neural Network models. For the [11] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston,
Oksana Yakhnenko, Translating embeddings for modeling multi-relational
second category, we further propose fine-grained taxonomies into
data, in: NIPS, 2013, pp. 1–9.
two views about the usage of inside information of KGs (including [12] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, Xuan Zhu, Learning
node attributes, entity-related information, relation-related infor- entity and relation embeddings for knowledge graph completion, in:
mation, neighbor information, and relational path information) or AAAI, Vol. 29, 2015.
outside information of KGs (including rule-based KGC and third- [13] Maximilian Nickel, Volker Tresp, Hans-Peter Kriegel, A three-way model
party data sources-based KGC). The third part pays attention to for collective learning on multi-relational data, in: ICML, 2011.
[14] Richard Socher, Danqi Chen, Christopher D. Manning, Andrew Ng, Rea-
other special KGC, such as CommonSense KGC, Temporal KGC, soning with neural tensor networks for knowledge base completion, in:
and Hyper-relational KGC. In particular, our survey provides a de- NIPS, Citeseer, 2013, pp. 926–934.
tailed and in-depth comparison and analysis of each KGC category [15] Zhen Wang, Jianwen Zhang, Jianlin Feng, Zheng Chen, Knowledge graph
in the fine-grained level and finally gives a global discussion and embedding by translating on hyperplanes, in: AAAI, Vol. 28, 2014.
prospect for the future research directions of KGC. This paper may [16] Quan Wang, Zhendong Mao, Bin Wang, Li Guo, Knowledge graph em-
bedding: A survey of approaches and applications, TKDE 29 (12) (2017)
help researchers grasp the main ideas and results of KGC, and
2724–2743.
to highlight an ongoing research on them. In the future, we will [17] Genet Asefa Gesese, Russa Biswas, Mehwish Alam, Harald Sack, A survey
design a relatively uniform evaluation framework and conduct on knowledge graph embeddings with literals: Which model links better
more detailed experimental evaluations. literal-ly?, 2019.
59
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
[18] Andrea Rossi, Denilson Barbosa, Donatella Firmani, Antonio Matinata, [47] Rudolf Kadlec, Ondrej Bajgar, Jan Kleindienst, Knowledge base com-
Paolo Merialdo, Knowledge graph embedding for link prediction: A pletion: Baselines strike back, 2017, arXiv preprint arXiv:1705.
comparative analysis, TKDD 15 (2) (2021) 1–49. 10744.
[19] Mayank Kejriwal, Advanced Topic: Knowledge Graph Completion, 2019. [48] Hitoshi Manabe, Katsuhiko Hayashi, Masashi Shimbo, Data-dependent
[20] Dat Quoc Nguyen, An overview of embedding models of entities and learning of symmetric/antisymmetric relations for knowledge base
relationships for knowledge base completion, 2017. completion, in: AAAI, Vol. 32, 2018.
[21] Hongyun Cai, Vincent W. Zheng, Kevin Chen-Chuan Chang, A comprehen- [49] Boyang Ding, Quan Wang, Bin Wang, Li Guo, Improving knowledge
sive survey of graph embedding: Problems, techniques, and applications, graph embedding using simple constraints, 2018, arXiv preprint arXiv:
TKDE 30 (9) (2018) 1616–1637. 1805.02408.
[22] Palash Goyal, Emilio Ferrara, Graph embedding techniques, applications, [50] Timothée Lacroix, Nicolas Usunier, Guillaume Obozinski, Canonical tensor
and performance: A survey, Knowl.-Based Syst. 151 (2018) 78–94. decomposition for knowledge base completion, in: ICML, PMLR, 2018, pp.
[23] Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, Philip S. 2863–2872.
Yu, A survey on knowledge graphs: Representation, acquisition and [51] Koki Kishimoto, Katsuhiko Hayashi, Genki Akai, Masashi Shimbo, Bina-
applications, 2020, arXiv preprint arXiv:2002.00388. rized canonical polyadic decomposition for knowledge graph completion,
[24] Heiko Paulheim, Knowledge graph refinement: A survey of approaches 2019, arXiv preprint arXiv:1912.02686.
and evaluation methods, Semant. Web 8 (3) (2017) 489–508. [52] Shuai Zhang, Yi Tay, Lina Yao, Qi Liu, Quaternion knowledge graph
[25] Baoxu Shi, Tim Weninger, Open-world knowledge graph completion, in: embeddings, 2019, arXiv preprint arXiv:1904.10281.
AAAI, Vol. 32, 2018. [53] Esma Balkir, Masha Naslidnyk, Dave Palfrey, Arpit Mittal, Using pairwise
[26] Agustín Borrego, Daniel Ayala, Inma Hernández, Carlos R. Rivero, David occurrence information to improve knowledge graph completion on
Ruiz, Generating rules to filter candidate triples for their correctness large-scale datasets, 2019, arXiv preprint arXiv:1910.11583.
checking by knowledge graph completion techniques, in: KCAP, 2019, [54] Ankur Padia, Konstantinos Kalpakis, Francis Ferraro, Tim Finin, Knowledge
pp. 115–122. graph fact prediction via knowledge-enriched tensor factorization, J. Web
[27] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Semant. 59 (2019) 100497.
Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative [55] Ledyard R. Tucker, Some mathematical notes on three-mode factor
adversarial networks, 2014, arXiv preprint arXiv:1406.2661. analysis, Psychometrika 31 (3) (1966) 279–311.
[28] Liwei Cai, William Yang Wang, Kbgan: Adversarial learning for knowledge [56] Tamara G. Kolda, Brett W. Bader, Tensor decompositions and applications,
graph embeddings, 2017, arXiv preprint arXiv:1711.04071. SIAM Rev. 51 (3) (2009) 455–500.
[29] Kairong Hu, Hai Liu, Tianyong Hao, A knowledge selective adversarial [57] Richard A. Harshman, Models for analysis of asymmetrical relationships
network for link prediction in knowledge graph, in: CCF NLPCC, Springer, among N objects or stimuli, in: First Joint Meeting of the Psychometric
2019, pp. 171–183. Society and the Society of Mathematical Psychology, Hamilton, Ontario,
[30] Jinghao Niu, Zhengya Sun, Wensheng Zhang, Enhancing knowledge graph 1978, 1978.
completion with positive unlabeled learning, in: ICPR, IEEE, 2018, pp. [58] Maximilian Nickel, Lorenzo Rosasco, Tomaso Poggio, Holographic
296–301. embeddings of knowledge graphs, in: AAAI, Vol. 30, 2016.
[31] Yanjie Wang, Rainer Gemulla, Hui Li, On multi-relational link prediction [59] Richard A. Harshman, Margaret E. Lundy, PARAFAC: Parallel factor
with bilinear models, in: AAAI, Vol. 32, 2018. analysis, Comput. Statist. Data Anal. 18 (1) (1994) 39–72.
[32] Kristina Toutanova, Danqi Chen, Observed versus latent features for [60] Daniel D. Lee, H. Sebastian Seung, Learning the parts of objects by
knowledge base and text inference, in: Proceedings of the 3rd Workshop non-negative matrix factorization, Nature 401 (6755) (1999) 788–791.
on Continuous Vector Space Models and their Compositionality, 2015, pp. [61] Ruslan Salakhutdinov, Nathan Srebro, Collaborative filtering in a non-
57–66. uniform world: Learning with the weighted trace norm, 2010, arXiv
[33] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel, preprint arXiv:1002.2780.
Convolutional 2d knowledge graph embeddings, in: AAAI, Vol. 32, 2018. [62] Shmuel Friedland, Lek-Heng Lim, Nuclear norm of higher-order tensors,
[34] Ke Tu, Peng Cui, Daixin Wang, Zhiqiang Zhang, Jun Zhou, Yuan Qi, Wenwu Math. Comp. 87 (311) (2018) 1255–1281.
Zhu, Conditional graph attention networks for distilling and refining [63] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan
knowledge graphs in recommendation, in: Proceedings of the 30th Durugkar, Akshay Krishnamurthy, Alex Smola, Andrew McCallum, Go for
ACM International Conference on Information & Knowledge Management, a walk and arrive at the answer: Reasoning over paths in knowledge
2021, pp. 1834–1843. bases using reinforcement learning, 2017, arXiv preprint arXiv:1711.
[35] Baoxu Shi, Tim Weninger, Discriminative predicate path mining for fact 05851.
checking in knowledge graphs, Knowl.-Based Syst. 104 (2016) 123–133. [64] Quan Liu, Hui Jiang, Andrew Evdokimov, Zhen-Hua Ling, Xiaodan Zhu, Si
[36] Shengbin Jia, Yang Xiang, Xiaojun Chen, Kun Wang, Triple trustworthiness Wei, Yu Hu, Probabilistic reasoning via deep learning: Neural association
measurement for knowledge graph, in: The World Wide Web Conference, models, 2016, arXiv preprint arXiv:1603.07704.
2019, pp. 2865–2871. [65] Saiping Guan, Xiaolong Jin, Yuanzhuo Wang, Xueqi Cheng, Shared embed-
[37] Ivana Balažević, Carl Allen, Timothy M. Hospedales, Tucker: Tensor ding based neural networks for knowledge graph completion, in: ACM
factorization for knowledge graph completion, 2019, arXiv preprint arXiv: CIKM, 2018, pp. 247–256.
1901.09590. [66] Feihu Che, Dawei Zhang, Jianhua Tao, Mingyue Niu, Bocheng Zhao,
[38] Rodolphe Jenatton, Nicolas Le Roux, Antoine Bordes, Guillaume Obozinski, Parame: Regarding neural network parameters as relation embeddings
A latent factor model for highly multi-relational data, in: NIPS), 2012, pp. for knowledge graph completion, in: Proceedings of the AAAI Conference
3176–3184. on Artificial Intelligence, Vol. 34, 2020, pp. 2774–2781.
[39] Alberto Garcia-Duran, Antoine Bordes, Nicolas Usunier, Yves Grandvalet, [67] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, Nilesh Agrawal, Partha
Combining two and three-way embeddings models for link prediction in Talukdar, Interacte: Improving convolution-based knowledge graph em-
knowledge bases, 2015, arXiv preprint arXiv:1506.00999. beddings by increasing feature interactions, in: AAAI, Vol. 34, 2020, pp.
[40] Hanxiao Liu, Yuexin Wu, Yiming Yang, Analogical inference for 3009–3016.
multi-relational embeddings, in: ICML, PMLR, 2017, pp. 2168–2178. [68] Tu Dinh Nguyen Dai Quoc Nguyen, Dat Quoc Nguyen, Dinh Phung, A novel
[41] Yi Tay, Anh Tuan Luu, Siu Cheung Hui, Falk Brauer, Random semantic embedding model for knowledge base completion based on convolutional
tensor ensemble for scalable knowledge graph link prediction, in: WSDM, neural network, in: NAACL-HLT, 2018, pp. 327–333.
2017, pp. 751–760. [69] Dai Quoc Nguyen, Thanh Vu, Tu Dinh Nguyen, Dat Quoc Nguyen, Dinh
[42] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, Li Deng, Embed- Phung, A capsule network-based embedding model for knowledge graph
ding entities and relations for learning and inference in knowledge bases, completion and search personalization, 2018, arXiv preprint arXiv:1808.
2014, arXiv preprint arXiv:1412.6575. 04122.
[43] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, Guillaume [70] Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne Van
Bouchard, Complex embeddings for simple link prediction, in: ICML, Den Berg, Ivan Titov, Max Welling, Modeling relational data with graph
PMLR, 2016, pp. 2071–2080. convolutional networks, in: ESWC, Springer, 2018, pp. 593–607.
[44] Seyed Mehran Kazemi, David Poole, Simple embedding for link prediction [71] Chao Shang, Yun Tang, Jing Huang, Jinbo Bi, Xiaodong He, Bowen Zhou,
in knowledge graphs, 2018, arXiv preprint arXiv:1802.04868. End-to-end structure-aware convolutional networks for knowledge base
[45] ABM Moniruzzaman, Richi Nayak, Maolin Tang, Thirunavukarasu Bal- completion, in: AAAI, Vol. 33, 2019, pp. 3060–3067.
asubramaniam, Fine-grained type inference in knowledge graphs via [72] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, Partha Talukdar,
probabilistic and tensor factorization methods, in: WWW, 2019, pp. Composition-based multi-relational graph convolutional networks, in:
3093–3100. International Conference on Learning Representations, 2019.
[46] Sameh K. Mohamed, Nová Vít, TriVec: Knowledge Graph Embeddings [73] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Janvin, A
for Accurate and Efficient Link Prediction in Real World Application neural probabilistic language model, J. Mach. Learn. Res. 3 (2003)
Scenarios. 1137–1155.
60
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
[74] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray [103] Qizhe Xie, Xuezhe Ma, Zihang Dai, Eduard H. Hovy, An interpretable
Kavukcuoglu, Pavel Kuksa, Natural language processing (almost) from knowledge transfer model for knowledge base completion, in: ACL (1),
scratch, J. Mach. Learn. Res. 12 (ARTICLE) (2011) 2493–2537. 2017.
[75] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, Efficient estimation [104] Wei Qian, Cong Fu, Yu Zhu, Deng Cai, Xiaofei He, Translating embeddings
of word representations in vector space, 2013, arXiv preprint arXiv: for knowledge graph completion with relation attention mechanism, in:
1301.3781. IJCAI, 2018, pp. 4286–4292.
[76] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Imagenet clas- [105] Jun Yuan, Neng Gao, Ji Xiang, TransGate: knowledge graph embedding
sification with deep convolutional neural networks, NIPS 25 (2012) with shared gate structure, in: AAAI, Vol. 33, 2019, pp. 3100–3107.
1097–1105. [106] Xiaofei Zhou, Qiannan Zhu, Ping Liu, Li Guo, Learning knowledge embed-
[77] Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton, Dynamic routing dings by combining limit-based scoring loss, in: ACM on CIKM, 2017, pp.
between capsules, 2017, arXiv preprint arXiv:1710.09829. 1009–1018.
[78] Joan Bruna, Wojciech Zaremba, Arthur Szlam, Yann LeCun, Spectral [107] Mojtaba Nayyeri, Sahar Vahdati, Jens Lehmann, Hamed Shariat Yazdi, Soft
networks and locally connected networks on graphs, 2014. marginal transe for scholarly knowledge graph completion, 2019, arXiv
[79] Na Li, Zied Bouraoui, Steven Schockaert, Ontology completion using graph preprint arXiv:1904.12211.
convolutional networks, in: ISWC, Springer, 2019, pp. 435–452. [108] Han Xiao, Minlie Huang, Yu Hao, Xiaoyan Zhu, TransA: An adaptive
[80] David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael approach for knowledge graph embedding, 2015, arXiv preprint arXiv:
Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, Ryan P. Adams, 1509.05490.
Convolutional networks on graphs for learning molecular fingerprints, [109] Takuma Ebisu, Ryutaro Ichise, Toruse: Knowledge graph embedding on a
2015, arXiv preprint arXiv:1509.09292. lie group, in: AAAI, Vol. 32, 2018.
[81] Aditya Grover, Aaron Zweig, Stefano Ermon, Graphite: Iterative generative [110] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, Jian Tang, RotatE: Knowledge
modeling of graphs, in: ICML, PMLR, 2019, pp. 2434–2444. graph embedding by relational rotation in complex space, in: ICLR, 2018.
[82] Thomas N. Kipf, Max Welling, Semi-supervised classification with graph [111] Ruobing Xie, Zhiyuan Liu, Fen Lin, Leyu Lin, Does william shakespeare
convolutional networks, 2016, arXiv preprint arXiv:1609.02907. really write hamlet? knowledge representation learning with confidence,
[83] Xiangyu Song, Jianxin Li, Qi Lei, Wei Zhao, Yunliang Chen, Ajmal in: AAAI, Vol. 32, 2018.
Mian, Bi-CLKT: Bi-graph contrastive learning based knowledge tracing, [112] Richong Zhang, Junpeng Li, Jiajie Mei, Yongyi Mao, Scalable instance
Knowl.-Based Syst. 241 (2022) 108274. reconstruction in knowledge bases via relatedness affiliated embedding,
[84] Xiangyu Song, Jianxin Li, Yifu Tang, Taige Zhao, Yunliang Chen, Ziyu Guan, in: WWW, 2018, pp. 1185–1194.
Jkt: A joint graph convolutional network based deep knowledge tracing, [113] Antoine Bordes, Jason Weston, Ronan Collobert, Yoshua Bengio, Learning
Inform. Sci. 580 (2021) 510–523. structured embeddings of knowledge bases, in: AAAI, Vol. 25, 2011.
[85] Albert T. Corbett, John R. Anderson, Knowledge tracing: Modeling the
[114] Ziqi Zhang, Effective and efficient semantic table interpretation using
acquisition of procedural knowledge, User Model. User-Adapt. Interact. 4
tableminer+, Semant. Web 8 (6) (2017) 921–957.
(4) (1994) 253–278.
[115] Yun Tang, Jing Huang, Guangtao Wang, Xiaodong He, Bowen Zhou, Or-
[86] Yaming Yang, Ziyu Guan, Jianxin Li, Wei Zhao, Jiangtao Cui, Quan Wang,
thogonal relation transforms with graph context modeling for knowledge
Interpretable and efficient heterogeneous graph convolutional network,
graph embedding, 2019, arXiv preprint arXiv:1911.04910.
IEEE Trans. Knowl. Data Eng. (2021).
[116] Nitin Bansal, Xiaohan Chen, Zhangyang Wang, Can we gain more from
[87] Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi Wu, Pathsim:
orthogonality regularizations in training deep networks? Adv. Neural Inf.
Meta path-based top-k similarity search in heterogeneous information
Process. Syst. 31 (2018) 4261–4271.
networks, Proc. VLDB Endow. 4 (11) (2011) 992–1003.
[117] Shengwu Xiong, Weitao Huang, Pengfei Duan, Knowledge graph embed-
[88] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
ding via relation paths and dynamic mapping matrix, in: International
Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative
Conference on Conceptual Modeling, Springer, 2018, pp. 106–118.
adversarial networks, 2014, arXiv preprint arXiv:1406.2661.
[118] Alberto Garcia-Duran, Mathias Niepert, Kblrn: End-to-end learning of
[89] Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu, Seqgan: Sequence
knowledge base representations with latent, relational, and numerical
generative adversarial nets with policy gradient, in: AAAI, Vol. 31, 2017.
features, 2017, arXiv preprint arXiv:1709.04676.
[90] Zhiqing Sun, Shikhar Vashishth, Soumya Sanyal, Partha Talukdar, Yiming
[119] T. Yi, L.A. Tuan, M.C. Phan, S.C. Hui, Multi-task neural network for
Yang, A re-evaluation of knowledge graph completion methods, 2019,
non-discrete attribute prediction in knowledge graphs, in: CIKM’17, 2017.
arXiv preprint arXiv:1911.03903.
[120] Yanrong Wu, Zhichun Wang, Knowledge graph embedding with numeric
[91] Deepak Nathani, Jatin Chauhan, Charu Sharma, Manohar Kaul, Learning
attributes of entities, in: Workshop on RepL4NLP, 2018, pp. 132–136.
attention-based embeddings for relation prediction in knowledge graphs,
in: ACL, 2019, pp. 4710–4723. [121] Huaping Zhong, Jianwen Zhang, Zhen Wang, Hai Wan, Zheng Chen,
[92] Jeffrey Pennington, Richard Socher, Christopher D. Manning, Glove: Global Aligning knowledge and text embeddings by entity descriptions, in:
vectors for word representation, in: EMNLP, 2014, pp. 1532–1543. EMNLP, 2015, pp. 267–272.
[93] Byungkook Oh, Seungmin Seo, Kyong-Ho Lee, Knowledge graph [122] Ruobing Xie, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Image-embodied
completion by context-aware convolutional learning with multi-hop knowledge representation learning, 2016, arXiv preprint arXiv:1609.
neighborhoods, in: ACM CIKM, 2018, pp. 257–266. 07028.
[94] Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, Maosong Sun, Represen- [123] Hatem Mousselly-Sergieh, Teresa Botschen, Iryna Gurevych, Stefan
tation learning of knowledge graphs with entity descriptions, in: AAAI, Roth, A multimodal translation-based approach for knowledge graph
Vol. 30, 2016. representation learning, in: SEM, 2018, pp. 225–234.
[95] Minjun Zhao, Yawei Zhao, Bing Xu, Knowledge graph completion via [124] Pouya Pezeshkpour, Liyan Chen, Sameer Singh, Embedding multimodal
complete attention between knowledge graph and entity descriptions, relational data for knowledge base completion, in: EMNLP, 2018.
in: CSAE, 2019, pp. 1–6. [125] Ye Liu, Hui Li, Alberto Garcia-Duran, Mathias Niepert, Daniel Onoro-Rubio,
[96] Tehseen Zia, Usman Zahid, David Windridge, A generative adversarial David S. Rosenblum, MMKG: multi-modal knowledge graphs, in: ESWC,
strategy for modeling relation paths in knowledge base representation Springer, 2019, pp. 459–474.
learning, 2019. [126] Agustinus Kristiadi, Mohammad Asif Khan, Denis Lukovnikov, Jens
[97] Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, Jun Zhao, Knowledge graph Lehmann, Asja Fischer, Incorporating literals into knowledge graph
embedding via dynamic mapping matrix, in: ACL and IJCNLP (Volume 1: embeddings, in: ISWC, Springer, 2019, pp. 347–363.
Long Papers), 2015, pp. 687–696. [127] Kai-Wei Chang, Wen-tau Yih, Bishan Yang, Christopher Meek, Typed
[98] Hee-Geun Yoon, Hyun-Je Song, Seong-Bae Park, Se-Young Park, A tensor decomposition of knowledge bases for relation extraction, in:
translation-based knowledge graph embedding preserving logical prop- EMNLP, 2014, pp. 1568–1579.
erty of relations, in: NAACL: Human Language Technologies, 2016, pp. [128] Denis Krompaß, Stephan Baier, Volker Tresp, Type-constrained repre-
907–916. sentation learning in knowledge graphs, in: ISWC, Springer, 2015, pp.
[99] Kien Do, Truyen Tran, Svetha Venkatesh, Knowledge graph embedding 640–655.
with multiple relation projections, in: ICPR, IEEE, 2018, pp. 332–337. [129] Shiheng Ma, Jianhui Ding, Weijia Jia, Kun Wang, Minyi Guo, Transt:
[100] Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, Mark Johnson, STransE: a novel Type-based multiple embedding representations for knowledge graph
embedding model of entities and relationships in knowledge bases, in: completion, in: ECML PKDD, Springer, 2017, pp. 717–733.
HLT-NAACL, 2016. [130] Alexandros Komninos, Suresh Manandhar, Feature-rich networks for
[101] Jun Feng, Minlie Huang, Mingdong Wang, Mantong Zhou, Yu Hao, Xiaoyan knowledge base completion, in: ACL (Volume 2: Short Papers), 2017, pp.
Zhu, Knowledge graph embedding by flexible translation, in: KR, 2016, 324–329.
pp. 557–560. [131] Elvira Amador-Domínguez, Patrick Hohenecker, Thomas Lukasiewicz,
[102] Miao Fan, Qiang Zhou, Emily Chang, Fang Zheng, Transition-based knowl- Daniel Manrique, Emilio Serrano, An ontology-based deep learning ap-
edge graph embedding with relational mapping properties, in: PACLIC, proach for knowledge graph completion with fresh entities, in: DCAI,
2014, pp. 328–337. Springer, 2019, pp. 125–133.
61
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
[132] Zhiting Hu, Poyao Huang, Yuntian Deng, Yingkai Gao, Eric Xing, Entity [160] T. Denoeux, A k-nearest neighbor classification rule based on
hierarchy embedding, in: ACL and IJCNLP (Volume 1: Long Papers), 2015, Dempster-Shafer theory, IEEE Trans. Syst. Man Cybern. 25 (5) (1995)
pp. 1292–1300. 804–813.
[133] Shu Guo, Quan Wang, Bin Wang, Lihong Wang, Li Guo, Semantically [161] Antoine Bordes, Xavier Glorot, Jason Weston, Yoshua Bengio, Joint learn-
smooth knowledge graph embedding, in: ACL and IJCNLP (Volume 1: Long ing of words and meaning representations for open-text semantic parsing,
Papers), 2015, pp. 84–94. in: Artificial Intelligence and Statistics, PMLR, 2012, pp. 127–135.
[134] Jianxin Ma, Peng Cui, Xiao Wang, Wenwu Zhu, Hierarchical taxonomy [162] Yashen Wang, Yifeng Liu, Huanhuan Zhang, Haiyong Xie, Leveraging lexi-
aware network embedding, in: ACM SIGKDD KDD, 2018, pp. 1920–1929. cal semantic information for learning concept-based multiple embedding
[135] Hanie Sedghi, Ashish Sabharwal, Knowledge completion for generics representations for knowledge graph completion, in: APWeb and WAIM
using guided tensor factorization, Trans. Assoc. Comput. Linguist. 6 (2018) Joint International Conference on Web and Big Data, Springer, 2019, pp.
197–210. 382–397.
[136] Bahare Fatemi, Siamak Ravanbakhsh, David Poole, Improved knowledge [163] Wenpeng Yin, Yadollah Yaghoobzadeh, Hinrich Schütze, Recurrent one-
graph embedding using background taxonomic information, in: AAAI, Vol. hop predictions for reasoning over knowledge graphs, in: COLING, 2018,
33, 2019, pp. 3526–3533. pp. 2369–2378.
[137] Mikhail Belkin, Partha Niyogi, Laplacian eigenmaps and spectral tech- [164] Ni Lao, Tom Mitchell, William Cohen, Random walk inference and
niques for embedding and clustering, in: NIPS, Vol. 14, 2001, pp. learning in a large scale knowledge base, in: EMNLP, 2011, pp. 529–539.
585–591. [165] Matt Gardner, Tom Mitchell, Efficient and expressive knowledge base
[138] Sam T. Roweis, Lawrence K. Saul, Nonlinear dimensionality reduction by completion using subgraph feature extraction, in: EMNLP, 2015, pp.
locally linear embedding, Science 290 (5500) (2000) 2323–2326. 1488–1498.
[139] Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, Knowledge graph completion [166] Arvind Neelakantan, Benjamin Roth, Andrew McCallum, Compositional
with adaptive sparse transfer matrix, in: AAAI, Vol. 30, 2016. vector space models for knowledge base completion, in: ACL and the
[140] Zhiqiang Geng, Zhongkun Li, Yongming Han, A novel asymmetric em- IJCNLP (Volume 1: Long Papers), 2015, pp. 156–166.
bedding model for knowledge graph completion, in: ICPR, IEEE, 2018, [167] Rajarshi Das, Arvind Neelakantan, David Belanger, Andrew McCallum,
pp. 290–295. Chains of reasoning over entities, relations, and text using recurrent
[141] Muhao Chen, Yingtao Tian, Xuelu Chen, Zijun Xue, Carlo Zaniolo, On2vec: neural networks, in: EACL (1), 2017.
Embedding-based relation prediction for ontology population, in: SIAM [168] Xiaotian Jiang, Quan Wang, Baoyuan Qi, Yongqin Qiu, Peng Li, Bin Wang,
ICDM, SIAM, 2018, pp. 315–323. Attentive path combination for knowledge graph completion, in: ACML,
PMLR, 2017, pp. 590–605.
[142] Ryo Takahashi, Ran Tian, Kentaro Inui, Interpretable and compositional
relation learning by joint training with an autoencoder, in: ACL (Volume [169] Yelong Shen, Po-Sen Huang, Ming-Wei Chang, Jianfeng Gao, Modeling
1: Long Papers), 2018, pp. 2148–2159. large-scale structured relationships with shared memory for knowledge
base completion, in: Workshop on Representation Learning for NLP, 2017,
[143] Kelvin Guu, John Miller, Percy Liang, Traversing knowledge graphs in
pp. 57–68.
vector space, 2015, arXiv preprint arXiv:1506.01094.
[170] Kai Lei, Jin Zhang, Yuexiang Xie, Desi Wen, Daoyuan Chen, Min Yang,
[144] Atsushi Suzuki, Yosuke Enokida, Kenji Yamanishi, Riemannian TransE:
Ying Shen, Path-based reasoning with constrained type attention for
Multi-relational graph embedding in non-euclidean space, 2018.
knowledge graph completion, Neural Comput. Appl. (2019) 1–10.
[145] Zili Zhou, Shaowu Liu, Guandong Xu, Wu Zhang, On completing sparse
[171] Kristina Toutanova, Xi Victoria Lin, Wen-tau Yih, Hoifung Poon, Chris
knowledge base with transitive relation embedding, in: AAAI, Vol. 33,
Quirk, Compositional learning of embeddings for relation paths in
2019, pp. 3125–3132.
knowledge base and text, in: ACL (Volume 1: Long Papers), 2016, pp.
[146] Charalampos E. Tsourakakis, Fast counting of triangles in large real
1434–1444.
networks without counting: Algorithms and laws, in: ICDM, IEEE, 2008,
[172] Xixun Lin, Yanchun Liang, Fausto Giunchiglia, Xiaoyue Feng, Renchu Guan,
pp. 608–617.
Relation path embedding in knowledge graphs, Neural Comput. Appl. 31
[147] Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, Mark Johnson, Neighborhood
(9) (2019) 5629–5639.
mixture model for knowledge base completion, 2016, arXiv preprint
[173] Vivi Nastase, Bhushan Kotnis, Abstract graphs and abstract paths for
arXiv:1606.06461.
knowledge graph completion, in: SEM, 2019, pp. 147–157.
[148] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero,
[174] Yao Zhu, Hongzhi Liu, Zhonghai Wu, Yang Song, Tao Zhang, Repre-
Pietro Lio, Yoshua Bengio, Graph attention networks, 2017, arXiv preprint
sentation learning with ordered relation paths for knowledge graph
arXiv:1710.10903.
completion, 2019, arXiv preprint arXiv:1909.11864.
[149] Fanshuang Kong, Richong Zhang, Yongyi Mao, Ting Deng, Lena: Locality- [175] Batselem Jagvaral, Wan-Kon Lee, Jae-Seung Roh, Min-Sung Kim, Young-
expanded neural embedding for knowledge base completion, in: AAAI, Tack Park, Path-based reasoning approach for knowledge graph comple-
Vol. 33, 2019, pp. 2895–2902. tion using CNN-BiLSTM with attention mechanism, Expert Syst. Appl. 142
[150] Trapit Bansal, Da-Cheng Juan, Sujith Ravi, Andrew McCallum, A2n: At- (2020) 112960.
tending to neighbors for knowledge graph inference, in: ACL, 2019, pp. [176] Tao Zhou, Jie Ren, Matúš Medo, Yi-Cheng Zhang, Bipartite network
4387–4392. projection and personal recommendation, Phys. Rev. E 76 (4) (2007)
[151] Peifeng Wang, Jialong Han, Chenliang Li, Rong Pan, Logic attention based 046115.
neighborhood aggregation for inductive knowledge graph embedding, in: [177] Matt Gardner, Partha Talukdar, Bryan Kisiel, Tom Mitchell, Improving
AAAI, Vol. 33, 2019, pp. 7152–7159. learning and inference in a large knowledge-base using latent syntactic
[152] Weidong Li, Xinyu Zhang, Yaqian Wang, Zhihuan Yan, Rong Peng, cues, in: EMNLP, 2013, pp. 833–838.
Graph2Seq: Fusion embedding learning for knowledge graph completion, [178] Matt Gardner, Partha Talukdar, Jayant Krishnamurthy, Tom Mitchell,
IEEE Access 7 (2019) 157960–157971. Incorporating vector space similarity in random walk inference over
[153] Zhao Zhang, Fuzhen Zhuang, Hengshu Zhu, Zhiping Shi, Hui Xiong, knowledge bases, in: EMNLP, 2014, pp. 397–406.
Qing He, Relational graph neural network with hierarchical attention for [179] Paul J. Werbos, Backpropagation through time: what it does and how to
knowledge graph completion, in: Proceedings of the AAAI Conference on do it, Proc. IEEE 78 (10) (1990) 1550–1560.
Artificial Intelligence, Vol. 34, 2020, pp. 9612–9619. [180] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bah-
[154] Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, William Yang Wang, danau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, Learning phrase
One-shot relational learning for knowledge graphs, 2018, arXiv preprint representations using RNN encoder-decoder for statistical machine
arXiv:1808.09040. translation, in: EMNLP, 2014.
[155] Jiatao Zhang, Tianxing Wu, Guilin Qi, Gaussian metric learning for few- [181] Dave Orr, Amar Subramanya, Evgeniy Gabrilovich, Michael Ringgaard,
shot uncertain knowledge graph completion, in: International Conference billion clues in 800 million documents: A web research corpus annotated
on Database Systems for Advanced Applications, Springer, 2021, pp. with freebase concepts. Google Research Blog, 11.
256–271. [182] Yadollah Yaghoobzadeh, Hinrich Schütze, Corpus-level fine-grained entity
[156] Sébastien Ferré, Link prediction in knowledge graphs with concepts of typing using contextual information, 2016, arXiv preprint arXiv:1606.
nearest neighbours, in: ESWC, Springer, 2019, pp. 84–100. 07901.
[157] Agustín Borrego, Daniel Ayala, Inma Hernández, Carlos R. Rivero, David [183] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Ruiz, CAFE: Knowledge graph completion using neighborhood-aware Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is all you need,
features, Eng. Appl. Artif. Intell. 103 (2021) 104302. in: NIPS, 2017.
[158] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, [184] Ali Sadeghian, Mohammadreza Armandpour, Patrick Ding, Daisy Zhe
End-to-end memory networks, 2015, arXiv preprint arXiv:1503.08895. Wang, Drum: End-to-end differentiable rule mining on knowledge graphs,
[159] Sébastien Ferré, Concepts de plus proches voisins dans des graphes 2019, arXiv preprint arXiv:1911.00055.
de connaissances, in: 28es Journées Francophones d’Ingénierie des [185] Quan Wang, Bin Wang, Li Guo, Knowledge base completion using
Connaissances IC 2017, 2017, pp. 163–174. embeddings and rules, in: IJCAI, 2015.
62
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
[186] Shangpu Jiang, Daniel Lowd, Dejing Dou, Learning to refine an automati- [214] Daphne Koller, Nir Friedman, Sašo Džeroski, Charles Sutton, Andrew
cally extracted knowledge base using markov logic, in: ICDM, IEEE, 2012, McCallum, Avi Pfeffer, Pieter Abbeel, Ming-Fai Wong, Chris Meek, Jennifer
pp. 912–917. Neville, et al., Introduction to Statistical Relational Learning, MIT Press,
[187] Jay Pujara, Hui Miao, Lise Getoor, William W. Cohen, Ontology-aware 2007.
partitioning for knowledge graph identification, in: AKBC Workshop, [215] Matthew Richardson, Pedro Domingos, Markov logic networks, Mach.
2013, pp. 19–24. Learn. 62 (1–2) (2006) 107–136.
[188] Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezny, Steven Schockaert, [216] William Yang Wang, Kathryn Mazaitis, William W. Cohen, Programming
Ondrej Kuzelka, Lifted relational neural networks: Efficient learning of with personalized pagerank: a locally groundable first-order probabilistic
latent relational structures, J. Artificial Intelligence Res. 62 (2018) 69–100. logic, in: CIKM, 2013, pp. 2129–2138.
[189] Ondřej Kuželka, Jesse Davis, Markov logic networks for knowledge base [217] Arvind Neelakantan, Quoc V. Le, Martin Abadi, Andrew McCallum, Dario
completion: A theoretical analysis under the MCAR assumption, in: UAI, Amodei, Learning a natural language interface with neural programmer,
PMLR, 2020, pp. 1138–1148. 2016, arXiv preprint arXiv:1611.08945.
[190] Yuyu Zhang, Xinshi Chen, Yuan Yang, Arun Ramamurthy, Bo Li, Yuan [218] Arvind Neelakantan, Quoc V. Le, Ilya Sutskever, Neural programmer:
Qi, Le Song, Efficient probabilistic logic reasoning with graph neural Inducing latent programs with gradient descent, 2015, arXiv preprint
networks, 2020, arXiv preprint arXiv:2001.11850. arXiv:1511.04834.
[191] Fan Yang, Zhilin Yang, William W. Cohen, Differentiable learning of [219] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein, Learning to
logical rules for knowledge base reasoning, 2017, arXiv preprint arXiv: compose neural networks for question answering, 2016, arXiv preprint
1702.08367. arXiv:1601.01705.
[192] Pouya Ghiasnezhad Omran, Kewen Wang, Zhe Wang, Scalable rule [220] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Dani-
learning via learning representation, in: IJCAI, 2018, pp. 2149–2155. helka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward
[193] Tim Rocktäschel, Deep prolog: End-to-end differentiable proving in Grefenstette, Tiago Ramalho, John Agapiou, et al., Hybrid computing using
knowledge bases, in: AITP 2017, 2017, p. 9. a neural network with dynamic external memory, Nature 538 (7626)
[194] Pasquale Minervini, Matko Bosnjak, Tim Rocktäschel, Sebastian Riedel, (2016) 471–476.
Towards neural theorem proving at scale, 2018, arXiv preprint arXiv: [221] William W. Cohen, Tensorlog: A differentiable deductive database, 2016,
1807.08204. arXiv preprint arXiv:1605.06523.
[195] Zhuoyu Wei, Jun Zhao, Kang Liu, Zhenyu Qi, Zhengya Sun, Guanhua Tian, [222] Leonid Boytsov, Bilegsaikhan Naidan, Engineering efficient and effective
Large-scale knowledge base completion: Inferring via grounding network non-metric space library, in: SISAP, Springer, 2013, pp. 280–293.
sampling over selected instances, in: CIKM, 2015, pp. 1331–1340. [223] Yu A. Malkov, Dmitry A. Yashunin, Efficient and robust approximate
[196] William Yang Wang, William W. Cohen, Learning first-order logic nearest neighbor search using hierarchical navigable small world graphs,
embeddings via matrix factorization, in: IJCAI, 2016, pp. 2132–2138. PAMI 42 (4) (2018) 824–836.
[224] Richard Evans, Edward Grefenstette, Learning explanatory rules from
[197] Shu Guo, Quan Wang, Lihong Wang, Bin Wang, Li Guo, Jointly embedding
noisy data, J. Artificial Intelligence Res. 61 (2018) 1–64.
knowledge graphs and logical rules, in: EMNLP, 2016, pp. 192–202.
[225] Guillaume Bouchard, Sameer Singh, Theo Trouillon, On approximate rea-
[198] Pengwei Wang, Dejing Dou, Fangzhao Wu, Nisansa de Silva, Lianwen Jin,
soning capabilities of low-rank vector spaces, in: AAAI Spring Symposia,
Logic rules powered knowledge graph embedding, 2019, arXiv preprint
Citeseer, 2015.
arXiv:1903.03772.
[226] Baoxu Shi, Tim Weninger, ProjE: Embedding projection for knowledge
[199] Shu Guo, Quan Wang, Lihong Wang, Bin Wang, Li Guo, Knowledge graph
graph completion, in: AAAI, Vol. 31, 2017.
embedding with iterative guidance from soft rules, in: AAAI, Vol. 32,
[227] Han Xiao, Minlie Huang, Lian Meng, Xiaoyan Zhu, SSP: semantic space
2018.
projection for knowledge graph embedding with text descriptions, in:
[200] Vinh Thinh Ho, Daria Stepanova, Mohamed Hassan Gad-Elrab, Evgeny
AAAI, Vol. 31, 2017.
Kharlamov, Gerhard Weikum, Learning rules from incomplete kgs using
[228] Xu Han, Zhiyuan Liu, Maosong Sun, Neural knowledge acquisition via
embeddings, in: ISWC, ceur. ws. org, 2018.
mutual attention between knowledge graph and text, in: AAAI, Vol. 32,
[201] Wen Zhang, Bibek Paudel, Liang Wang, Jiaoyan Chen, Hai Zhu, Wei Zhang,
2018.
Abraham Bernstein, Huajun Chen, Iteratively learning embeddings and
[229] Paolo Rosso, Dingqi Yang, Philippe Cudré-Mauroux, Revisiting text and
rules for knowledge graph reasoning, in: WWW, 2019, pp. 2366–2377.
knowledge graph joint embeddings: The amount of shared information
[202] Meng Qu, Jian Tang, Probabilistic logic neural networks for reasoning,
matters!, in: 2019 IEEE Big Data, IEEE, 2019, pp. 2465–2473.
2019, arXiv preprint arXiv:1906.08495.
[230] Bo An, Bo Chen, Xianpei Han, Le Sun, Accurate text-enhanced knowledge
[203] Jianfeng Du, Jeff Z. Pan, Sylvia Wang, Kunxun Qi, Yuming Shen, Yu Deng, graph representation learning, in: NAACL: Human Language Technologies,
Validation of growing knowledge graphs by abductive text evidences, in: Volume 1 (Long Papers), 2018, pp. 745–755.
AAAI, Vol. 33, 2019, pp. 2784–2791. [231] Teng Long, Ryan Lowe, Jackie Chi Kit Cheung, Doina Precup, Leveraging
[204] Christian Meilicke, Melisachew Wudage Chekol, Daniel Ruffinelli, Heiner lexical resources for learning entity embeddings in multi-relational data,
Stuckenschmidt, Anytime bottom-up rule learning for knowledge graph in: ACL (2), 2016.
completion, in: IJCAI, 2019, pp. 3137–3143. [232] Miao Fan, Qiang Zhou, Thomas Fang Zheng, Ralph Grishman, Distributed
[205] Jiangtao Ma, Yaqiong Qiao, Guangwu Hu, Yanjun Wang, Chaoqin Zhang, representation learning for knowledge graphs with entity descriptions,
Yongzhong Huang, Arun Kumar Sangaiah, Huaiguang Wu, Hongpo Zhang, Pattern Recognit. Lett. 93 (2017) 31–37.
Kai Ren, ELPKG: A high-accuracy link prediction approach for knowledge [233] Jiacheng Xu, Xipeng Qiu, Kan Chen, Xuanjing Huang, Knowledge graph
graph completion, Symmetry 11 (9) (2019) 1096. representation with jointly structural and textual encoding, in: IJCAI,
[206] Guanglin Niu, Yongfei Zhang, Bo Li, Peng Cui, Si Liu, Jingyang Li, Xiaowei 2017.
Zhang, Rule-guided compositional representation learning on knowledge [234] Michael Cochez, Martina Garofalo, Jérôme Lenßen, Maria Angela Pelle-
graphs, in: AAAI, Vol. 34, 2020, pp. 2950–2958. grino, A first experiment on including text literals in KGloVe, 2018, arXiv
[207] Luis Galárraga, Christina Teflioudi, Katja Hose, Fabian M. Suchanek, Fast preprint arXiv:1807.11761.
rule mining in ontological knowledge bases with AMIE +, VLDB J. 24 (6) [235] Nada Mimouni, Jean-Claude Moissinac, Anh Vu, Knowledge base com-
(2015) 707–730. pletion with analogical inference on context graphs, in: Semapro 2019,
[208] Christian Meilicke, Manuel Fink, Yanjie Wang, Daniel Ruffinelli, Rainer 2019.
Gemulla, Heiner Stuckenschmidt, Fine-grained evaluation of rule-and [236] Liang Yao, Chengsheng Mao, Yuan Luo, KG-BERT: BERT for knowledge
embedding-based systems for knowledge graph completion, in: ISWC, graph completion, 2019, arXiv preprint arXiv:1909.03193.
Springer, 2018, pp. 3–20. [237] Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu, Juanzi Li, Jian
[209] Yang Chen, Sean Goldberg, Daisy Zhe Wang, Soumitra Siddharth Johri, Tang, KEPLER: A unified model for knowledge embedding and pre-trained
Ontological pathfinding, in: International Conference on Management of language representation, 2019, arXiv preprint arXiv:1911.06136.
Data, 2016, pp. 835–846. [238] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
[210] Tim Rocktäschel, Matko Bosnjak, Sameer Singh, Sebastian Riedel, Low- Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, Roberta:
dimensional embeddings of logic, in: ACL 2014 Workshop on Semantic A robustly optimized bert pretraining approach, 2019, arXiv preprint
Parsing, 2014, pp. 45–49. arXiv:1907.11692.
[211] Tim Rocktäschel, Sameer Singh, Sebastian Riedel, Injecting logical back- [239] Daniel Daza, Michael Cochez, Paul Groth, Inductive entity representations
ground knowledge into embeddings for relation extraction, in: NAACL: from text via link prediction, in: Proceedings of the Web Conference 2021,
Human Language Technologies, 2015, pp. 1119–1129. 2021, pp. 798–808.
[212] Stephen Muggleton, et al., Stochastic logic programs, in: Advances in [240] Bo Wang, Tao Shen, Guodong Long, Tianyi Zhou, Ying Wang, Yi Chang,
Inductive Logic Programming, Vol. 32, Citeseer, 1996, pp. 254–264. Structure-augmented text representation learning for efficient knowledge
[213] Stephen Muggleton, Inductive Logic Programming, Vol. 38, Morgan graph completion, in: Proceedings of the Web Conference 2021, 2021, pp.
Kaufmann, 1992. 1737–1748.
63
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
[241] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christo- [267] Farzaneh Mahdisoltani, Joanna Biega, Fabian Suchanek, Yago3: A knowl-
pher Clark, Kenton Lee, Luke Zettlemoyer, Deep contextualized word edge base from multilingual wikipedias, in: CIDR, CIDR Conference,
representations, 2018, arXiv preprint arXiv:1802.05365. 2014.
[242] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, [268] Itsumi Saito, Kyosuke Nishida, Hisako Asano, Junji Tomita, Common-
Improving language understanding by generative pre-training, 2018. sense knowledge base completion and generation, in: CoNLL, 2018, pp.
[243] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, Bert: Pre- 141–150.
training of deep bidirectional transformers for language understanding, [269] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv
2018, arXiv preprint arXiv:1810.04805. Batra, C. Lawrence Zitnick, Devi Parikh, Vqa: Visual question answering,
[244] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhut- in: ICCV, 2015, pp. 2425–2433.
dinov, Quoc V. Le, Xlnet: Generalized autoregressive pretraining for [270] Andrej Karpathy, Li Fei-Fei, Deep visual-semantic alignments for
language understanding, 2019, arXiv preprint arXiv:1906.08237. generating image descriptions, in: CVPR, 2015, pp. 3128–3137.
[245] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, Dis- [271] Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, Jason We-
tributed representations of words and phrases and their compositionality, ston, Engaging image captioning via personality, in: IEEE/CVF CVPR, 2019,
2013, arXiv preprint arXiv:1310.4546. pp. 12516–12526.
[246] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, [272] Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, Richard
Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is all you need, Socher, Explain yourself! leveraging language models for commonsense
2017, arXiv preprint arXiv:1706.03762. reasoning, 2019, arXiv preprint arXiv:1906.02361.
[247] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, Qun [273] Pengcheng Yang, Lei Li, Fuli Luo, Tianyu Liu, Xu Sun, Enhancing topic-to-
Liu, ERNIE: Enhanced language representation with informative entities, essay generation with external commonsense knowledge, in: ACL, 2019,
in: Proceedings of the 57th Annual Meeting of the Association for pp. 2002–2012.
Computational Linguistics, 2019, pp. 1441–1451. [274] Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, Subham Biswas, Min-
[248] Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuan-Jing lie Huang, Augmenting end-to-end dialogue systems with commonsense
Huang, Zheng Zhang, CoLAKE: Contextualized language and knowledge knowledge, in: AAAI, Vol. 32, 2018.
embedding, in: Proceedings of the 28th International Conference on [275] Xiang Li, Aynaz Taheri, Lifu Tu, Kevin Gimpel, Commonsense knowledge
Computational Linguistics, 2020, pp. 3660–3670. base completion, in: ACL (Volume 1: Long Papers), 2016, pp. 1445–1455.
[249] Boran Hao, Henghui Zhu, Ioannis Paschalidis, Enhancing clinical bert [276] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli
embedding using a biomedical knowledge base, in: Proceedings of the Celikyilmaz, Yejin Choi, Comet: Commonsense transformers for automatic
28th International Conference on Computational Linguistics, 2020, pp. knowledge graph construction, 2019, arXiv preprint arXiv:1906.05317.
657–661. [277] Chaitanya Malaviya, Chandra Bhagavatula, Antoine Bosselut, Yejin Choi,
[250] Robert Logan, Nelson F. Liu, Matthew E. Peters, Matt Gardner, Sameer Commonsense knowledge base completion with structural and semantic
Singh, Barack’s wife hillary: Using knowledge graphs for fact-aware context, in: AAAI, Vol. 34, 2020, pp. 2925–2933.
language modeling, in: ACL, 2019, pp. 5962–5971. [278] Xuelu Chen, Muhao Chen, Weijia Shi, Yizhou Sun, Carlo Zaniolo, Em-
bedding uncertain knowledge graphs, in: AAAI, Vol. 33, 2019, pp.
[251] Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur
3363–3370.
Joshi, Sameer Singh, Noah A. Smith, Knowledge enhanced contextual
word representations, in: EMNLP-IJCNLP, 2019, pp. 43–54. [279] Yohan Chalier, Simon Razniewski, Gerhard Weikum, Joint reasoning for
multi-faceted commonsense knowledge, 2020, arXiv preprint arXiv:2001.
[252] Tingsong Jiang, Tianyu Liu, Tao Ge, Lei Sha, Sujian Li, Baobao Chang, Zhi-
04170.
fang Sui, Encoding temporal information for time-aware link prediction,
[280] Wentao Wu, Hongsong Li, Haixun Wang, Kenny Q. Zhu, Probase: A prob-
in: EMNLP, 2016, pp. 2350–2354.
abilistic taxonomy for text understanding, in: ACM SIGMOD International
[253] Rishab Goel, Seyed Mehran Kazemi, Marcus Brubaker, Pascal Poupart,
Conference on Management of Data, 2012, pp. 481–492.
Diachronic embedding for temporal knowledge graph completion, in:
[281] Robyn Speer, Joshua Chin, Catherine Havasi, Conceptnet 5.5: An open
AAAI, Vol. 34, 2020, pp. 3988–3995.
multilingual graph of general knowledge, in: AAAI, Vol. 31, 2017.
[254] Chenjin Xu, Mojtaba Nayyeri, Fouad Alkhoury, Hamed Yazdi, Jens
[282] Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bishan
Lehmann, Temporal knowledge graph completion based on time series
Yang, Justin Betteridge, Andrew Carlson, Bhanava Dalvi, Matt Gardner,
Gaussian embedding, in: ISWC, Springer, 2020, pp. 654–671.
Bryan Kisiel, et al., Never-ending learning, Commun. ACM 61 (5) (2018)
[255] Julien Leblay, Melisachew Wudage Chekol, Deriving validity time in
103–115.
knowledge graph, in: Companion Proceedings of the the Web Conference
[283] Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula,
2018, 2018, pp. 1771–1776.
Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, Yejin Choi,
[256] Shib Sankar Dasgupta, Swayambhu Nath Ray, Partha Talukdar, Hyte:
Atomic: An atlas of machine commonsense for if-then reasoning, in: AAAI,
Hyperplane-based temporally aware knowledge graph embedding, in:
Vol. 33, 2019, pp. 3027–3035.
EMNLP, 2018, pp. 2001–2011.
[284] Robert Speer, Catherine Havasi, ConceptNet 5: A large semantic network
[257] Yunpu Ma, Volker Tresp, Erik A. Daxberger, Embedding models for for relational knowledge, in: The People’s Web Meets NLP, Springer, 2013,
episodic knowledge graphs, J. Web Semant. 59 (2019) 100490. pp. 161–176.
[258] Alberto García-Durán, Sebastijan Dumančić, Mathias Niepert, Learning [285] Jastrzebski Stanislaw, Dzmitry Bahdanau, Seyedarian Hosseini, Michael
sequence encoders for temporal knowledge graph completion, 2018, arXiv Noukhovitch, Yoshua Bengio, Jackie Chi Kit Cheung, Commonsense mining
preprint arXiv:1809.03202. as knowledge base completion? A study on the impact of novelty, 2018,
[259] Timothée Lacroix, Guillaume Obozinski, Nicolas Usunier, Tensor decom- arXiv preprint arXiv:1804.09259.
positions for temporal knowledge base completion, 2020, arXiv preprint [286] Hugo Liu, Push Singh, ConceptNet—a practical commonsense reasoning
arXiv:2004.04926. tool-kit, BT Technol. J. 22 (4) (2004) 211–226.
[260] Rakshit Trivedi, Hanjun Dai, Yichen Wang, Le Song, Know-evolve: Deep [287] Damian Szklarczyk, John H. Morris, Helen Cook, Michael Kuhn, Ste-
temporal reasoning for dynamic knowledge graphs, in: ICML, PMLR, 2017, fan Wyder, Milan Simonovic, Alberto Santos, Nadezhda T. Doncheva,
pp. 3462–3471. Alexander Roth, Peer Bork, et al., The STRING database in 2017: quality-
[261] Woojeong Jin, He Jiang, Meng Qu, Tong Chen, Changlin Zhang, Pedro controlled protein–protein association networks, made broadly accessible,
Szekely, Xiang Ren, Recurrent event network: Global structure inference Nucleic Acids Res. (2016) gkw937.
over temporal knowledge graph, 2019, arXiv preprint arXiv:1904.05530. [288] Push Singh, Thomas Lin, Erik T. Mueller, Grace Lim, Travell Perkins,
[262] Zhen Han, Yuyi Wang, Yunpu Ma, Stephan Guünnemann, Volker Tresp, Wan Li Zhu, Open mind common sense: Knowledge acquisition from the
The graph hawkes network for reasoning on temporal knowledge graphs, general public, in: OTM Confederated International Conferences" on the
2020, arXiv preprint arXiv:2003.13432. Move to Meaningful Internet Systems", Springer, 2002, pp. 1223–1237.
[263] Jiapeng Wu, Meng Cao, Jackie Chi Kit Cheung, William L Hamilton, TeMP [289] Gabor Angeli, Christopher D. Manning, Philosophers are mortal: Inferring
Temporal Message Passing for Temporal Knowledge Graph Completion, the truth of unseen facts, in: CoNLL, 2013, pp. 133–142.
2020, arXiv preprint arXiv:2010.03526. [290] Jonathan Gordon, Benjamin Van Durme, Reporting bias and knowledge
[264] Michael D Ward, Andreas Beger, Josh Cutler, Matthew Dickenson, Cassy acquisition, in: Workshop on AKBC, 2013, pp. 25–30.
Dorff, Ben Radford, Comparing GDELT and ICEWS event data, Analysis 21 [291] Jianfeng Wen, Jianxin Li, Yongyi Mao, Shini Chen, Richong Zhang, On
(1) (2013) 267–297. the representation and embedding of knowledge bases beyond binary
[265] Aaron Schein, John Paisley, David M. Blei, Hanna Wallach, Bayesian pois- relations, in: IJCAI, 2016, pp. 1300–1307.
son tensor factorization for inferring multilateral relations from sparse [292] Saiping Guan, Xiaolong Jin, Yuanzhuo Wang, Xueqi Cheng, Link prediction
dyadic event counts, in: ACM SIGKDD KDD, 2015, pp. 1045–1054. on n-ary relational data, in: WWW, 2019, pp. 583–593.
[266] Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, Gerhard Weikum, [293] Michalis Faloutsos, Petros Faloutsos, Christos Faloutsos, On power-law
YAGO2: A spatially and temporally enhanced knowledge base from relationships of the internet topology, in: The Structure and Dynamics of
Wikipedia, Artificial Intelligence 194 (2013) 28–61. Networks, Princeton University Press, 2011, pp. 195–206.
64
T. Shen, F. Zhang and J. Cheng Knowledge-Based Systems 255 (2022) 109597
[294] Mark Steyvers, Joshua B. Tenenbaum, The large-scale structure of seman- [300] Romain Paulus, Caiming Xiong, Richard Socher, A deep reinforced model
tic networks: Statistical analyses and a model of semantic growth, Cogn. for abstractive summarization, 2017, arXiv preprint arXiv:1705.04304.
Sci. 29 (1) (2005) 41–78. [301] Kelvin Guu, Panupong Pasupat, Evan Zheran Liu, Percy Liang, From
[295] Prodromos Kolyvakis, Alexandros Kalousis, Dimitris Kiritsis, Hyperkg: language to programs: Bridging reinforcement learning and maximum
Hyperbolic knowledge graph embeddings for knowledge base completion, marginal likelihood, 2017, arXiv preprint arXiv:1704.07926.
2019, arXiv preprint arXiv:1908.04895. [302] Peng Lin, Qi Song, Yinghui Wu, Fact checking in knowledge graphs with
[296] Maximillian Nickel, Douwe Kiela, Learning continuous hierarchies in ontological subgraph patterns, Data Sci. Eng. 3 (4) (2018) 341–358.
the lorentz model of hyperbolic geometry, in: ICML, PMLR, 2018, pp. [303] Wenhan Xiong, Thien Hoang, William Yang Wang, Deeppath: A rein-
3779–3788. forcement learning method for knowledge graph reasoning, 2017, arXiv
[297] Octavian Ganea, Gary Bécigneul, Thomas Hofmann, Hyperbolic entailment preprint arXiv:1707.06690.
cones for learning hierarchical embeddings, in: ICML, PMLR, 2018, pp. [304] Yelong Shen, Jianshu Chen, Po-Sen Huang, Yuqing Guo, Jianfeng Gao,
1646–1655. Reinforcewalk: Learning to walk in graph with monte carlo tree search,
[298] Frederic Sala, Chris De Sa, Albert Gu, Christopher Ré, Representa- 2018.
tion tradeoffs for hyperbolic embeddings, in: ICML, PMLR, 2018, pp. [305] Rich Caruana, Multitask learning, Mach. Learn. 28 (1) (1997) 41–75.
4460–4469.
[299] Sumit Chopra Marc’Aurelio Ranzato, Michael Auli, Wojciech Zaremba,
Sequence level training with recurrent neural networks, 2015, CoRR
abs/1511.06732.
65