0% found this document useful (0 votes)

16 views15 pages

AKMiner Domain-Specific Knowledge Graph Mining

The document presents AKMiner, a system designed to automatically extract academic concepts and relationships from literature in specific research domains and visualize them as knowledge graphs. It utilizes a Markov Logic Network for the extraction process and includes a visualization module to create concept-cloud and concept relation graphs. Experimental results indicate that AKMiner is effective in enhancing academic search systems by providing a structured overview of knowledge in various domains.

Uploaded by

alikmlsipahi.cs.tech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views15 pages

AKMiner Domain-Specific Knowledge Graph Mining

Uploaded by

alikmlsipahi.cs.tech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

AKMiner: Domain-Specific Knowledge Graph Mining

from Academic Literatures

Shanshan Huang and Xiaojun Wan*

Institute of Computer Science and Technology,

The MOE Key Laboratory of Computational Linguistics, Peking University,
Beijing 100871, China
{huangshanshan2010,wanxiaojun}@pku.edu.cn

Abstract. Existing academic search systems like Google Scholar usually return
a long list of scientific articles for a given research domain or topic (e.g. “doc-
ument summarization”, “information extraction”), and users need to read vo-
lumes of articles to get some ideas of the research progress for a domain, which
is very tedious and time-consuming. In this paper, we propose a novel system
called AKMiner (Academic Knowledge Miner) to automatically mine useful
knowledge from the articles in a specific domain, and then visually present the
knowledge graph to users. Our system consists of two major components: a) the
extraction module which extracts academic concepts and relations jointly based
on Markov Logic Network, and b) the visualization module which generates
knowledge graphs, including concept-cloud graphs and concept relation graphs.
Experimental results demonstrate the effectiveness of each component of our
proposed system.

Keywords: Knowledge graph generation, academic knowledge extraction,

academic literature mining, Markov logic, AKMiner.

1 Introduction
Academic literatures offer scientific researchers knowledge about current academic
progress as well as history in a specific research domain (e.g. “document summariza-
tion”, “information extraction”). By reading scientific literatures, the beginners grasp
basic knowledge of a research domain before in-depth study, and experienced re-
searchers conveniently obtain the information of recent significant progress. However,
relevant academic information is usually overloaded, and researchers often find an
overwhelming number of publications of interests. Digital libraries offer various data-
base querying tools, and Internet search companies have developed academic search
engines. Typical academic search engines like Google Scholar, Microsoft Academic
Search and CiteSeer, all achieve good retrieval performance.
However, most of the academic search systems simply return a list of relevant ar-
ticles by matching keywords or author’s information. Usually, there are quite a lot of
articles for a specific topic, and it is hard for researchers to have a quick glimpse on
the whole structure of knowledge, especially for the beginners.
*
Corresponding author.

X. Lin et al. (Eds.): WISE 2013, Part II, LNCS 8181, pp. 241–255, 2013.
© Springer-Verlag Berlin Heidelberg 2013
242 S. Huang and X. Wan

Another way to catch the development of a research domain or topic is to read re-
view papers, such as survey papers published in ACM Computing Surveys every
year. However, there are usually only a few high-quality review papers available for
most research domains. Besides, the publishing cycle of review papers is relatively
long, compared to fast updating of research achievements. Moreover, review papers
are usually very long, and the knowledge is embedded in texts, which often fails to
show the knowledge structure visually and vividly.
To help researchers relieve the burden of tedious paper reading, and acquire infor-
mation about the recent achievements and developments quickly, we propose a novel
system called AKMiner (Academic Knowledge Miner) to extract academic concepts
and relations from academic literatures and generate knowledge graphs for a given
research domain or topic. Hence, researchers can quickly get a basic vision of a re-
search domain and learn the latest achievements and developments directly.
Our AKMiner system consists of two phases: a) extraction of academic concepts
and relations, and b) academic knowledge graph generation. For the first step, Markov
Logic Network (MLN) is applied to build a joint model for extracting academic con-
cepts and their relations from literatures simultaneously. A concept cloud graph and a
concept relation graph are then generated to visually present domain-specific concepts
and relations, respectively. For concept cloud graph generation, the PageRank algo-
rithm [19] is applied to calculate the importance scores of different concepts in a
domain. The more important a concept is, the bigger it is displayed in the graphs.
Experiments are conducted on datasets in four domains, and the training data and test
data are from different domains. Evaluation results show that our proposed approach
is more effective than several baseline methods (e.g. support vector machine (SVM),
conditional random fields (CRF), C-Value) for knowledge extraction. Case studies
show several good characteristics of the generated knowledge graphs.
The contributions of our study are summarized as follows:
• We propose a novel AKMiner system to mine knowledge graphs for a research
domain, which can be used for enhancing existing academic search systems.
• We propose a joint model based on Markov Logic Network to extract academic
concepts and their relations.
Experimental results on four datasets and cases of knowledge graphs show the
effectiveness of our proposed system.

2 Related Works
2.1 Information Extraction
Information extraction (IE) techniques have been widely investigated for various
purposes in the text mining and natural language processing fields. Typical IE tasks
include named entity recognition (NER), relation detection and classification, and
event detection and classification. State-of-the-art NER systems usually formulate
NER as a sequence labeling problem, and employ various discriminative structured
prediction models (e.g. hidden Markov model, maximum-entropy Markov model,
CRF) to resolve it. Relation detection and classification aims to extract relations
among entities and classify them into different categories. Many statistical techniques
have been investigated to predict entity relations such as [32]. Recently, joint models
Domain-Specific Knowledge Graph Mining from Academic Literatures 243

have been proposed to extract entities and relations simultaneously [20], [21], which
achieve superior performance to pipeline models. Event detection and classification
aims to mine significant events and group them into relevant topics, benefiting more
discussions and comparisons within and crossing topics.
Term extraction or terminology extraction is a subtask of information extraction,
which aims to extract multi-word expressions from a large corpus. Different metho-
dologies for automatic term extraction have been investigated, including linguistic,
statistical and hybrid approaches [5]. Linguistic approaches basically identify terms
by using some heuristic rules or patterns [8]. Statistical approaches usually rank can-
didate terms according to a criterion, which is able to distinguish among true and false
terms and give higher confidence to the better terms [3], [7], [10]. Hybrid approaches
usually combine linguistic and statistical approaches into a two-stage framework [7],
[15]. Keyphrase extraction is a very similar task with term extraction. Most keyphrase
extraction methods first extract candidate phrases with natural language processing
techniques, and then rank the candidate phrases and select the final keyphrases with
supervised or unsupervised algorithms [14], [18], [30].

2.2 Academic Literature Mining

In recent years, various text mining techniques have been investigated in the academic
domain. Such techniques include metadata extraction [11], [13], paper summarization
[1], [23], survey generation [2], [23], [31], literature search [9], [17], paper recom-
mendation [28], and trend visualization [6], [25], [29]. In particular, [29] extracts
elemental technologies through the structure of research paper’s title and analyzes
technical trend in any research fields. [6] reports an effort to integrate statistics, text
analytics, and visualization in a multiple coordinated window environment to support
rapid understanding of scientific paper collections. [25] creates metro maps using
metrics of influence, coverage, and connectivity for scientific literature, and shows
the relations between papers. However, they all do not present fine-grained know-
ledge structure from paper content directly.

2.3 Markov Logic Networks

MLN is a powerful representation for statistical relational learning, and it has been
applied in a few tasks in the area of information extraction. For example, [2] uses
Markov Logic Network to extract database records from text or semi-structured
sources. [26] proposes a system utilizing MLN for entity resolution, and [22] imple-
ments it into joint unsupervised coreference resolution. Besides, [27] uses MLN to
discover social relationships in consumer photo collections.

3 Overview

3.1 System Overview

The purpose of our AKMiner system is to generate knowledge graphs for academic
domains, which is useful for researchers to get useful information quickly and
244 S. Huang and X. Wan

visually. Given a set of literatures

Domain-specific
in a specific domain, academic
literatures
knowledge graphs can be built
through the flow in Figure 1. Texts
The framework of our pro- Preprocessing
posed system consists of two
main procedures: a) the extrac- NPs
tion module which extracts aca- Extraction of concepts
demic concepts and relations, and and relations by MLN
b) the visualization module which
visualizes knowledge graphs. Concepts and Relations
Prior to the two main procedures,
Visualization
a preprocessing step is taken, as
described in Section 6. To make
inputs for MLN, we use a chunk- Concept-cloud graph Concept relation graph
ing tool to get noun phrases
(NPs) from academic literatures Fig. 1. Framework of AKMiner system
and build the candidate set by
some preprocessing. Then, we extract useful academic concepts and relations among
them with our proposed joint model. Finally, knowledge graphs, including concept-
cloud graph and concept relation graph, are generated based on the extracted concepts
and relations.

3.2 Definition of Concepts and Relations

In this section, we define academic concepts and relations used in our system. To
describe an academic domain, we often present the main tasks in the domain, and
summarize the main methods applied to solve the tasks. Hence, in this study, we focus
on two kinds of academic concepts: Task and Method.
The Task concepts are specific problems to be solved in academic literatures, in-
cluding all concepts related to tasks, subtasks, problems and projects, like “machine
translation”, “document summarization”, “query-focused summarization”, etc.
The Method concepts are defined as ways to solve specific Tasks, including all
concepts describing algorithms, techniques, models, tools and so on, like “Markov
logic”, “CRFs”, “heuristic-based algorithm”, and “XML-based tool”.
The relations extracted by our system include two kinds of relations: the relations
between the two kinds of academic concepts (i.e. Method-Task relations), and the
relations within the same kind of academic concepts (i.e. Method-Method relations
and Task-Task relations). The first kind of relations are formed when a Method is
applied to a referred Task (e.g., “extractive method” for “document summarization”).
The latter kind of relations between Methods or between Tasks are formed by de-
pendency, evolution and enhancements (e.g., “Markov model” and “hidden Markov
model”, “single document summarization” and “multi-document summarization”).
Domain-Specific Knowledge Graph Mining from Academic Literatures 245

4 Markov Logic Networks

Markov Logic Network (MLN) [24] is a probabilistic logic model which combines
the idea of Markov network with first-order logic. A first-order knowledge base (KB)
is a set of formulas in first-order logic, in which the predicates and functions are used
to describe properties and relations among the objects. Particularly, a function (e.g.,
Anna = MotherOf(Bob)) represents a mapping from objects to objects, while a predi-
cate represents relations among objects or some attributes (e.g., Friends(Jim, Bob)).
In a first-order KB, if a case violates even one formula, it has zero probability.
The basic idea of Markov logic is to soften these hard constraints with associated
weights, so that they can be violated with some penalty.
Formally, a Markov Logic Network L is a set of pairs (Fi, wi), where Fi, is a formu-
la in first-order logic, and wi is the weight of formula Fi. A Markov Logic Network
specifies a probability distribution over the set of possible worlds as follows,
1
( = )= · · ( )

where ( ) is the number of true groundings of Fi in possible worlds , and Z is the

normalization constant, given by = ∑ (∑ · ( )).

5 MLN for Extraction of Concepts and Relations

In this section, we describe the construction of Markov Logic Network for concepts
and relations extraction from academic literatures. Particularly, the attributes and
relations of concepts are represented by predicates, and rules impose certain con-
straints over those predicates. We now describe our proposed approach in detail.

5.1 Concepts and Predicates

Our goal is to extract Method and Task concepts. For sake of predicates formulation,
we add another kind of concepts, Other concept, not belonging to Method nor Task.
We define two kinds of predicates: query predicates and evidence predicates. The
values of query predicates are unknown and need to be inferred. Here, the query pre-
dicates are the categories of concepts and the relations between them, including:
• Method(concept): It indicates that the concept is a Method.
• Task(concept): It indicates that the concept describes a Task.
• Relation(concept1,concept2): It indicates that there is a relation between a Method
and a Task. The first concept is a Method and the second one is a Task.
• Relation_m(concept1,concept2): It indicates a relation between two Methods.
• Relation_t(concept1,concept2): It indicates a relation between two Tasks.
The evidence predicates can be any features extracted from the inputs, and the
values of them are already known. We represent the evidence predicates as follows:
• Key_m(concept): A concept contains a keyword related to Method concept. For
example, “graph-based method” contains Method related keyword “method”.
• Key_t(concept): A concept contains a keyword related to Task concept.
246 S. Huang and X. Wan

• Key_m_outside(concept): A keyword related to Method is detected around the

concept in a text. For example, in the text “… we apply CRF in sequence annota-
tion …”, the keyword “apply” appears before concept “CRF”.
• Key_t_outside(concept): A keyword related to Task is detected around the concept
in a text. For example, the keywords “task of” occurs before Task concept “paper
summarization” in the text “… task of paper summarization …”.
• Neighbor(concept1,concept2): Two concepts appearing closely are neighbors. We
define “closely” within three sentences in our experiments.
• Contain(concept1,concept2): This predicate is true only in two cases. One is that
concept c1 only contains one more suffix than concept c2, such as “CRF model” and
“CRF”. Another case is that the last words of two concepts are the same, such as
concept “single document summarization” and “document summarization”.
• Apposition(concept1,concept2): It indicates that the concepts are appositional in a
text. We define the appositional format as “concept1 and (or) concept2”.
Note that in our experiments, all the keywords are collected and summarized ma-
nually. We investigate through reading numerous articles and collect four lists of
keywords for Key_m, Key_t, Key_m_outside and Key_t_outside, respectively. The
keywords for Key_m/t only contain nouns (17 words for Method, and 10 for Task),
such as “algorithm”, “method”, “model” for Method, and “project”, “problem” for
Task. The keywords for Key_m/t_outside contain nouns and verbs (60 for Method and
31 for Task), such as “propose”, “present”, “describe”, etc. These keywords are inde-
pendent of domains, and can be utilized in any domains.

5.2 Rules in MLN

Hard Rules
Hard rules describe the hard constraints that should always hold true. These rules are
given a prior weight larger than other rules.
Non-overlapping Rules
We make the rule that three categories of concepts (Task, Method and Other) do not
overlap. That is to say, a concept only belongs to one of the categories.
( ) ! ( ) ! ( ) (1)
( ) ! ( ) ! ( ) (2)
( ) ! ( ) ! ( ) (3)

Rules from Definition.

These rules are from the definition of three query predicates indicating relations.
( , ) ( ) ( ) (4)
_ ( , ) ( ) ( ) (5)
_ ( , ) ( ) ( ) (6)

Soft Rules
These rules describe constraints that we expect to be usually true, but not all the time.
Domain-Specific Knowledge Graph Mining from Academic Literatures 247

Neighbor-based Rules
We assume that two neighbor concepts probably have a relation.
( , ) ( ) ( ) ( , ) (7)
( , ) ( ) ( ) _ ( , ) (8)
( , ) ( ) ( ) _ ( , ) (9)

Keyword-based Rules
We consider keywords as important clues to extract academic concepts. For example,
a concept with the word “algorithm” as suffix is probably a Method, and a concept
with “problem” as suffix is likely to be a Task. But sometimes this rule is not true. For
instance, “efficient method” is not a useful concept. So keyword-based rules are soft.
_ ( ) ( ) (10)
_ ( ) ( ) (11)
In addition, the words around concept phrases also offer much useful information,
such as the words “propose”, “demonstrate”, and “present”. A concept with this kind
of keywords appearing around probably belongs to the related concept category. The
rules are as follows.
_ _ ( ) ( ) (12)
_ _ ( ) ( ) (13)

Containing-based Rules
From texts, we find that if the predicate Contain (c1, c2) is true, c1 likely contains
more modifiers than c2. If c1 is a Task, c2 tends to be a Task, too. For example, Con-
tain(“Spanisah to English MT”, “MT”) ^ Task(“Chinese to English MT”)
Task(“MT”). In this case, c1 helps to determine the category of c2. But if c1 is a Me-
thod, we cannot perform the inference. For example: Contain(“phrase-based MT”,
“MT” ) ^ Method( “phrase-based MT”) Method(“MT”).
However, if c2 is a Method concept, c1 is probably a Method concept. For instance,
Contain (“Hidden Markov model”, “Markov model”) ^ Method (“Markov mod-
el”) => Method (“Hidden Markov model”). On the other hand, if c2 is a Task con-
cept, the inference is unreasonable. For example, Contain (“phrase-based statistical
machine translation”, “machine translation”) ^ Task (“machine translation”)
Task (“phrase-based statistical machine translation”). In all, we have the for-
mulas below.
( , ) ( ) ( ) (14)
( , ) ( ) ( ) (15)

Apposition Rules
Apposition information can also be used for extraction of concepts. Given the predi-
cate Apposition (c1, c2), we add a rule that if two concepts are appositive, they likely
belong to the same category. The rule is built below:
( , ) ( ) ( )
( ) ( ) ( ) ( ) (16)
Actually, keyword-based rules are the basic rules to recognize concepts, and contain-
ing-based and apposition rules help to detect some missing concepts (e.g. conditional
248 S. Huang and X. Wan

random fields) and correct some wrong judgments, so that the concept extraction is
more complete.
Transitivity Rules
Actually, some useful concept pairs may fail to be extracted, so we apply transitivity
in relation inference. It is supposed that if c1 and c2 have a relation, while c2 and c3
have a relation, and then concept c1 and c3 potentially have a relation. However, the
precondition is that they are in the same category. Relations between Task and Me-
thod are not assumed to have transitivity. The rules are represented below:
_ ( , ) _ ( , ) _ ( , ) (17)
_ ( , ) _ ( , ) _ ( , ) (18)

6 Empirical Evaluation

6.1 Evaluation Setup

Dataset and Evaluation Metrics

Our goal is to automatically extract useful knowledge from a set of literatures in a
specific domain1. Since there exist many different research domains and new research
domains emerge rapidly, it is impossible to label training data on each domain in
practice. Therefore, experiments will be performed on a cross-domain setting, i.e. the
training data and the test data are from different domains. To build the datasets, we
collect 200 literature articles from the ACL corpus2 on 4 domains in the field of natu-
ral language processing: “Statistical Machine Translation” (SMT), “Document Sum-
marization” (DS), “Sentiment Analysis and Opinion Mining” (SAOM) and “Refer-
ence Resolution” (RR), and each domain contains 50 literature articles. We use ar-
ticles from ACL corpus because they are well formatted, with pages about 7 to 10,
and the text edition is conveniently available on the Internet. In our experiments, only
the title, abstract, introduction and related work sections are extracted and used, be-
cause these sections have covered most concepts and relations in literature articles.
Besides, the other text sections, such as approach and experiments, often contain
much noise, like equations, figures and tables. Therefore, we only focus on partial
texts of the literatures in this study. We read the texts and manually label the concepts
and their relations. In our supervised experiments, we use the labeled data from three
domains as training set, and use the labeled data from the remaining one domain as
test set. Therefore, four-fold validation are conducted. We calculate the Precision (P),
Recall (R) and F-measure (F) values to measure the performance of extractions.
Data Preprocessing
Considering that a concept is usually a noun phrase, we first use a noun phrase
(NP) chunking tool - the StanfordNLP toolkit3 to get all NPs from literature texts.

1
The size of the literature set in a domain can range from several to thousands.
2
http://www.acl.org/ The datasets used will be published on our website.
3
http://nlp.stanford.edu/
Domain-Specific Knowledge Graph Mining from Academic Literatures 249

The initial NP set contains many useless phrases, such as “we”, “this paper”, “future
work”, etc. So we build a simple filter to filter out some NPs by using several lin-
guistic rules, and also collect a “stop words” list to abandon some useless NPs or
some useless words in NPs, such as “some”, “many”, “efficient”, “general”, etc. In
addition, too long or too short NPs are also excluded.
Baselines
To verify the performance of our proposed joint model, we develop three baseline
methods (CRF model, SVM classifier, and C-Value method) for concept extraction
and a baseline method (SVM classifier) for relation extraction.
The CRF model [16] can be used to extract academic concepts in literatures. To
make training data for the CRF model, we search the labeled concepts in the litera-
tures and mark the occurrences of the concepts. If one concept occurs in a literature,
each word covered by the concept phrase will be labeled with relevant tags (T, M).
The other words are marked with irrelevant tag (O). When two concept phrases are
overlapping, such as “machine translation” and “statistical machine translation”, we
consider the longer one as the complete phrase. The features used in the CRF model
include the current word, words around the current word, part of speech and keyword-
based features. The lists of keywords are the same as they are for MLN. Besides, to
guarantee the fairness of the comparison, NP features are also used in CRF, including
a word’s position in an NP (i.e. outside, beginning, middle and ending of an NP).
Support Vector Machine [4] is another popular method for information extraction,
classifying NPs into three categories. The features used for SVM include prefix and
suffix information of NP, and keyword-based features. The keyword-based features
also include keywords inside and around concepts, which are used in MLN.
C-Value [10] is an unsupervised algorithm for multi-word terms recognition, and
here we refer to concept phrases. It extracts phrases according to the frequency of
occurrence of phrases, and enhances the common statistical measure of frequency by
using linguistic information and combining statistical features of the candidate NPs.
For concept relation extraction, we use SVM as a baseline. Classifications are con-
ducted on candidate concept pairs, which are combinations of two adjacent concepts.
The candidate concept pairs are classified into two categories, having relations or not.
The features for classification include phrases’ length and position information, rela-
tion related keywords and the keywords’ position information. Phrases’ lengths are
calculated by ignoring brackets and the context inside brackets. The position informa-
tion includes positions in a sentence, the orders of phrase pairs, and the distances
between phrases in pairs. Relation related keywords are also collected manually, in-
cluding phrases like “based on”, “enhancement”, “developed on”, etc.
Alchemy
We utilize the Alchemy system 4 in our experiments. Alchemy is an open-source
package providing a series of algorithms for statistical relational learning and proba-
bilistic logic inference, based on the Markov logic representation.

4
http://alchemy.cs.washington.edu/
250 S. Huang and X. Wan

6.2 Evaluation Results

Results of Concept Extraction

The number of extracted academic concepts and the performance values of the differ-
ent methods on four domains are shown in Table 1. The performance values of Me-
thod extraction, Task extraction, and the overall extraction are calculated separately.
As C-Value is an algorithm for recognition of useful phrases, and it cannot distinguish
the concept types, so only the overall extraction performance is measured. The aver-
age values of F-measure are calculated based on F-measures on four domains.
We can see that the overall performance values of our proposed joint model
(MLN) are much better than the baselines. Compared to the baseline methods
that can only extract concepts independently, the MLN model infers concepts and
relations jointly, and it can take into consideration the joint information in a domain.
In addition, the rules make it possible to supplement some missing concepts and
remove some incorrect concepts. So the extraction results are more accurate and
complete.

Table 1. Comparison Results of Concept Extraction

Test Task Concept Method Concept Task + Method

Domain
MLN CRF SVM MLN CRF SVM MLN CRF SVM CValue
No. 48 48 84 275 283 293 323 331 377 324
SMT P 0.875 0.542 0.529 0.920 0.618 0.870 0.913 0.607 0.796 0.574
R 0.824 0.510 0.870 0.907 0.626 0.916 0.894 0.609 0.909 0.564
F 0.849 0.526 0.658 0.913 0.622 0.892 0.904 0.608 0.849 0.569
No. 96 107 269 172 181 183 268 288 452 373
DS P 0.958 0.701 0.390 0.936 0.586 0.929 0.944 0.628 0.608 0.413
R 0.836 0.682 0.955 0.885 0.582 0.934 0.866 0.620 0.942 0.527
F 0.893 0.691 0.554 0.910 0.584 0.932 0.904 0.624 0.739 0.463
No. 185 200 534 323 353 351 508 553 885 455
SAOM P 0.789 0.750 0.322 0.944 0.572 0.895 0.888 0.637 0.549 0.673
R 0.741 0.761 0.873 0.892 0.591 0.918 0.837 0.653 0.902 0.568
F 0.764 0.756 0.471 0.917 0.581 0.906 0.862 0.645 0.683 0.616
No. 95 111 281 177 196 195 272 307 476 297
RR P 0.958 0.748 0.320 0.972 0.607 0.903 0.967 0.658 0.559 0.731
R 0.858 0.783 0.849 0.901 0.623 0.921 0.886 0.680 0.896 0.731
F 0.905 0.765 0.465 0.935 0.615 0.912 0.924 0.669 0.688 0.731
P 0.895 0.685 0.390 0.943 0.596 0.899 0.928 0.633 0.628 0.598
Average R 0.815 0.684 0.887 0.896 0.606 0.922 0.871 0.641 0.912 0.597
F 0.853 0.685 0.537 0.919 0.601 0.911 0.899 0.637 0.740 0.595
Domain-Specific Knowledge Graph Mining from Academic Literatures 251

We can also find that the extraction of Methods is generally more effective than
the extraction of Tasks. Considering the characteristics of academic literatures, they
are usually discussing several methods for one task, so the number of Methods is
certainly larger than that of Tasks, which can be confirmed from the numbers of
extracted concepts in Table 1. With more ground atoms of Method in training data,
the MLN model can learn a better model and performs more effectively for Method
extraction.
Finally, we can conclude that our joint model helps to extract concepts in a more
comprehensive way by taking into account global information and the relations be-
tween concepts. This advantage will be more notable in relation extraction.

Results of Relation Extraction

In our MLN model, concepts and relations are extracted together. While in the base-
line, concepts are firstly extracted by the baseline model with the best performance
(i.e. the SVM model), and then the SVM classifier is used for relation classification
based on the extracted concept pairs. The results are shown in Table 2.
The performance of MLN is much better than SVM. Making full use of joint in-
formation, the joint model helps to improve the performance of knowledge extraction.
Although the SVM classifier returns more relation pairs, the recall value it gets is
lower than that of MLN. This shows that the SVM classifier brings more incorrect
results and its ability of discrimination between positive and negative cases in relation
extraction is relatively poor.

Table 2. Results of Relation Extraction

SMT DS
No. P R F No. P R F
MLN 469 0.85 0.82 0.84 277 0.90 0.74 0.81
SVM 668 0.52 0.71 0.60 368 0.67 0.73 0.70
SAOM RR
No. P R F No. P R F
MLN 745 0.92 0.76 0.83 344 0.94 0.86 0.90
SVM 687 0.69 0.53 0.60 440 0.53 0.62 0.57
Average performance in four domains
MLN P: 90.2%; R: 79.6%; F: 84.6%.
SVM P: 60.1%; R: 64.7%; F: 62.3%.

7 Knowledge Graph Generation

7.1 Concept Importance Assessment

Before generating knowledge graphs, we assess the importance of different concepts

in a specific domain, which is the basis of concept cloud graph generation. Impor-
tance assessment is based on the frequency a concept occurs in the literatures and the
252 S. Huang and X. Wan

importance of other concepts that have relations with this concept. We have two
assumptions:
• The more frequently a concept appears, the more important it is.
• The more important the other concepts related to the current concept are, the more
important the current concept is.
In addition, the position where a concept occurs is also considered into importance
assessment. For instance, a concept occurs in title will get a larger importance weight
than the ones in other sections. The position-based weight of a concept is multiplied
with the frequency it occurs in the corresponding section. In this study, each concept’s
prior importance score is given by its frequency, and the frequency of a concept oc-
curs in title is multiplied by a weight (we assign 1.2 in experiments).
In the experiments, the personalized PageRank algorithm [12] is applied for con-
cept importance assessment. The personalized PageRank algorithm can take into ac-
count both the prior scores and the concept relations, and the importance score of a
concept is iteratively calculated until convergence.

7.2 Concept Cloud Graph Generation

In order to help researchers have a clearer and more comprehensive understanding of
the concepts, concept cloud graph is generated according to the importance of con-
cepts. Given the numerical importance scores of academic concepts, concept-clouds
are produced by the “Wordle” platform5.
As a case, the concept-cloud graph for the “DS” domain is shown in Figures 2.
Due to the limited space, the graph only contains a portion of the concepts. After tak-
ing a look at the graph, we can know the most important concepts in the “DS” domain
clearly and quickly. It is obvious that the most notable concept is “multi-document
summarization”, while “document summarization” and “single-document summariza-
tion” are relatively less notable. This implies that more researches focus on multi-
document summarization nowadays. The concept “topic detection and tracking” is
also notable, as it has many associations with “Document Summarization”. In addi-
tion, concepts with larger size are mostly Task concepts, indicating that Task concepts
occur more frequently in literatures.

Fig. 2. Concept Cloud Graph on “DS” Domain

5
http://www.wordle.net/advanceds
Domain-Specific Knowledge Graph Mining from Academic Literatures 253

Fig. 3. Part of Relations Graph on “DS” Domain

(Yellow boxes denote Tasks, and green boxes represent Method concepts. Red arrows denote Task-Method rela-
tions, blue arrows denote Task-Task relations, and green arrows denote Method-Method relationships.)

7.3 Concept Relation Graph Generation

In order to present the extracted concept relations in academic literatures vividly, a
concept relation graph is built on a research domain. Taking the “DS” domain as an
example, the concept relation graph is shown in Figure 3. Again because of space
limitation, only a portion of concept relations are shown. From the graph, we can
learn that the “document summarization” Task evolves into a few Tasks, such as “sin-
gle-document summarization”, “multi-document summarization”, “monolingual doc-
ument summarization”, “query-focused summarization” and so on. Many Methods
applied to these Tasks are shown directly and vividly, and relations between Methods
are also shown visually.

8 Conclusions and Future Work

In this paper, we propose a novel AKMiner system to extract academic concepts and
relations between concepts from academic literatures in a research domain. The ex-
tracted results are visually presented to users via concept-cloud graph and concept
relation graph. In future work, we will further classify relations into more specific
categories, and even relations among three or more concepts will be investigated. We
will also explore to recommend Methods for some Tasks through analysis of the net-
work of concepts and their relations. This will bring much inspiration to scientific
research work.

Acknowledgments. The work was supported by NSFC (61170166) and National

High-Tech R&D Program (2012AA011101).
254 S. Huang and X. Wan

References
1. Abu-Jbara, A., Radev, D.: Coherent Citation-Based Summarization of Scientific Papers.
In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguis-
tics, pp. 500–509 (2011)
2. Agarwal, N., Gvr, K.: SciSumm: A Multi-Document Summarization System for Scientific
Articles. In: Proceedings of the ACL-HLT 2011 System Demonstrations, pp. 115–120
(2011)
3. Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography.
In: ACL 1989, pp. 76–83 (1989)
4. Cortes, C., Vapnik, V.: Support-vector Networks. Machine Learning 20(3), 273–297
(1995)
5. Daille, B.: Combined approach for terminology extraction: lexical statistics and linguistic
filtering. Technical Report (1995)
6. Dunne, C., Shneiderman, B., Gove, R., Klavans, J., Dorr, B.: Rapid Understanding of
Scientific Paper Collections: Integrating Statistics, Text Analytics, and Visualization. Uni-
versity of Maryland, Human-Computer Interaction Lab. Tech. Report HCIL-2011 (2011)
7. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computa-
tional Linguistics 19(1), 61–74 (1994)
8. Earl, L.L.: Experiments in automatic extracting and indexing. Information Storage and Re-
trieval 6(X), 273–288 (1970)
9. EL-Arini, K., Guestrin, C.: Beyond Keyword Search: Discovering Relevant Scientific Li-
terature. In: Proceedings of the 17th SIGKDD (2011)
10. Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the C-
value / NC-value method. International Journal of Digital Library 3, 115–130 (2000)
11. Han, H., Giles, C., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic Document Me-
ta-data Extraction using Support Vector Machines. In: Proceedings of Joint Conference on
Digital Libraries (2003)
12. Haveliwala, T.H.: Topic-sensitive pagerank. In: WWW 2002, pp. 517–526. ACM, New
York (2002)
13. Isaac, Councill, G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF reference string
parsing package. In: Proceedings of the Language Resources and Evaluation Conference
(LREC 2008), Marrakesh, Morrocco (May 2008)
14. Jiang, X., Hu, Y., Li, H.: A ranking approach to keyphrase extraction. In Microsoft Re-
search Technical Report (2009)
15. Justeson, J., Katz, S.: Technical terminology: some linguistic properties and an algorithm
for identification in text. Natural Language Engineering 1, 9–27 (1995)
16. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)
17. Li, N., Zhu, L., Mitra, P., Mueller, K., Poweleit, E.: oreChem ChemXSeer: a semantic dig-
ital library for chemistry. In: JCDL (2010)
18. Mihalcea, R., Tarau, P.: TextRank: Bringing order into texts. In: Proceedings of EMNLP
2004 (2004)
19. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing
order to the web. Technical report, Stanford Digital Libraries (1998)
20. Poon, H., Domingos, P.: Joint inference in information extraction. In: Proceedings of
AAAI 2007 (2007)
Domain-Specific Knowledge Graph Mining from Academic Literatures 255

21. Poon, H., Vanderwende, L.: Joint inference for knowledge extraction from biomedical lite-
rature. In: Proceedings of the 2010 Annual Conference of the North American Chapter of
the Association for Computational Linguistics (2010)
22. Poon, H., Domingos, P.: Joint unsupervised coreference resolution with Markov logic. In:
Proceedings of the Conference on Empirical Methods in Natural Language Processing
(2008)
23. Qazvinian, V., Radev, D.R.: Scientific Paper Summarization Using Citation Summary
Networks. In: Proceedings of COLING 2008, vol. 1, pp. 689–696 (2008)
24. Richardson, M., Domingos, P.: Markov Logic Networks. Machine Learing 62(1-2),
107–136
25. Shahaf, D., Guestrin, C., Horvitz, E.: Metro Maps of Science. In: Proceedings of the 18th
ACM SIGKDD (2012)
26. Singla, P., Domingos, P.: Entity resolution with markov logic. In: Proceedings of ICDM
2006 (2006)
27. Singla, P., Kautz, H., Luo, J.: Discovery of social relationships in consumer photo collec-
tions using Markov Logic. In: Workshops of CVPRW 2008 (2008)
28. S.K., Kan, M.: Scholarly paper recommendation via user’s recent research interests.
In: JCDL (2010)
29. Kondo, T., Nanba, H., Takezawa, T., Okumura, M.: Technical Trend Analysis by
Analyzing Research Papers’ Titles. In: Vetulani, Z. (ed.) LTC 2009. LNCS, vol. 6562,
pp. 512–521. Springer, Heidelberg (2011)
30. Wan, X.J., Xiao, J.G.: Single document keyphrase extraction using neighborhood know-
ledge. In: Proceedings of AAAI 2008 (2008)
31. Yeloglu, O., Milios, E., Zincir-Heywood, N.: Multi-document Summarization of Scientific
Corpora. In: SAC (2011)
32. Zhu, J., Nie, Z., Liu, X., Zhang, B.: StatSnowball: a statistical approach to extracting entity
relationships. In: Proceedings of 18th WWW Conference (2009)

Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
Jaya D. Kapoor Alamuri Ratnamala Institute of Engineering and Technology, Shahpur Kailas K. Devadkar Sardar Patel Institute of Technology, Andheri
No ratings yet
Jaya D. Kapoor Alamuri Ratnamala Institute of Engineering and Technology, Shahpur Kailas K. Devadkar Sardar Patel Institute of Technology, Andheri
6 pages
Text Extraction Research Paper
No ratings yet
Text Extraction Research Paper
6 pages
Automatic Extraction For Text Summarization: A Survey: Keyword
No ratings yet
Automatic Extraction For Text Summarization: A Survey: Keyword
12 pages
Keyword Extraction Measure
No ratings yet
Keyword Extraction Measure
9 pages
Journal Pre-Proofs: Expert Systems With Applications
No ratings yet
Journal Pre-Proofs: Expert Systems With Applications
16 pages
Information Retrieval
No ratings yet
Information Retrieval
3 pages
Zhang 2015
No ratings yet
Zhang 2015
5 pages
A Keyword Extraction Approach For Single Document Extractive Summarization Based On Topic Centrality
No ratings yet
A Keyword Extraction Approach For Single Document Extractive Summarization Based On Topic Centrality
9 pages
tmp6D8D TMP
No ratings yet
tmp6D8D TMP
5 pages
Survey Data Analysis
No ratings yet
Survey Data Analysis
17 pages
1507 02140v1
No ratings yet
1507 02140v1
34 pages
Research Review of The Knowledge Graph and Its Application
No ratings yet
Research Review of The Knowledge Graph and Its Application
20 pages
New Research PDF
No ratings yet
New Research PDF
2 pages
3.4.24 - 2020 - A Relationship Extraction Method For Domain Knowledge Graph Construction
No ratings yet
3.4.24 - 2020 - A Relationship Extraction Method For Domain Knowledge Graph Construction
19 pages
Knowledge Discovery in Digital Libraries of Electronic Theses and Dissertations: An NDLTD Case Study
No ratings yet
Knowledge Discovery in Digital Libraries of Electronic Theses and Dissertations: An NDLTD Case Study
9 pages
Knowledge Discovery in Digital Libraries of Electronic Theses and Dissertations: An NDLTD Case Study
No ratings yet
Knowledge Discovery in Digital Libraries of Electronic Theses and Dissertations: An NDLTD Case Study
10 pages
Mining The Web Searching and Integration
No ratings yet
Mining The Web Searching and Integration
5 pages
Textual Knowledge Discovery
No ratings yet
Textual Knowledge Discovery
7 pages
Building Information Extraction System Based On Computing Domain Ontology
No ratings yet
Building Information Extraction System Based On Computing Domain Ontology
5 pages
Semi-Structured Data Extraction From Heterogeneous Sources: Xiaoying Gao and Leon Sterling
No ratings yet
Semi-Structured Data Extraction From Heterogeneous Sources: Xiaoying Gao and Leon Sterling
11 pages
Text Mining Techniques Overview
100% (1)
Text Mining Techniques Overview
4 pages
Keyword Extraction Issues and Methods
No ratings yet
Keyword Extraction Issues and Methods
33 pages
8921-Article Text-15992-1-10-20210614
No ratings yet
8921-Article Text-15992-1-10-20210614
7 pages
S Ciento Metrics 18 Information Mining
No ratings yet
S Ciento Metrics 18 Information Mining
60 pages
DR 3
No ratings yet
DR 3
7 pages
An Overview of E-Documents Classification: January 2009
No ratings yet
An Overview of E-Documents Classification: January 2009
10 pages
IJCER (WWW - Ijceronline.com) International Journal of Computational Engineering Research
No ratings yet
IJCER (WWW - Ijceronline.com) International Journal of Computational Engineering Research
4 pages
Research Paper
No ratings yet
Research Paper
4 pages
Text Mining in Data Mining Guide
No ratings yet
Text Mining in Data Mining Guide
18 pages
Unit 1
No ratings yet
Unit 1
8 pages
Semantic Distances
No ratings yet
Semantic Distances
32 pages
SOWNDARRAJAN Journal
No ratings yet
SOWNDARRAJAN Journal
7 pages
(2022) Knowledge Graph - A Giude Tour (21 Pages)
No ratings yet
(2022) Knowledge Graph - A Giude Tour (21 Pages)
21 pages
Prior Steps Into Knowledge Mapping: Text Mining Application and Comparison
No ratings yet
Prior Steps Into Knowledge Mapping: Text Mining Application and Comparison
8 pages
Is WC 06 Welty Murdock
No ratings yet
Is WC 06 Welty Murdock
14 pages
44-Research Trends in Text Mining
No ratings yet
44-Research Trends in Text Mining
12 pages
Arabic Keyphrase Extraction
0% (1)
Arabic Keyphrase Extraction
77 pages
IJISM Volume 23 Issue 2 Pages 199-211
No ratings yet
IJISM Volume 23 Issue 2 Pages 199-211
13 pages
MIT - Accelerating Scientific Discovery With KG and AI
No ratings yet
MIT - Accelerating Scientific Discovery With KG and AI
83 pages
Text Data Mining Insights
No ratings yet
Text Data Mining Insights
8 pages
Skeleton Sentences
No ratings yet
Skeleton Sentences
5 pages
Study On Web Designing
No ratings yet
Study On Web Designing
8 pages
Differentiating Between Data-Mining and Text-Mining Terminology
No ratings yet
Differentiating Between Data-Mining and Text-Mining Terminology
15 pages
LRV A Tool For Academic Text Visualizati
No ratings yet
LRV A Tool For Academic Text Visualizati
11 pages
Knowledge Management: A Text Mining Approach
No ratings yet
Knowledge Management: A Text Mining Approach
10 pages
A Brief Survey of Text Mining and Its Applications
No ratings yet
A Brief Survey of Text Mining and Its Applications
6 pages
Text Mining: Concepts, Process and Applications: January 2013
No ratings yet
Text Mining: Concepts, Process and Applications: January 2013
5 pages
Conference Paper IEEE
No ratings yet
Conference Paper IEEE
7 pages
A Brief Survey On Data Mining For Biological and Environmental Problems.
No ratings yet
A Brief Survey On Data Mining For Biological and Environmental Problems.
46 pages
What Kind of Sematics We Are Using and What Semantics We Need To Achieve The Vision of Complete Semantic Web
No ratings yet
What Kind of Sematics We Are Using and What Semantics We Need To Achieve The Vision of Complete Semantic Web
3 pages
Intelligent Information Retrieval From The Web
No ratings yet
Intelligent Information Retrieval From The Web
4 pages
Twitter Text Mining Techniques
No ratings yet
Twitter Text Mining Techniques
4 pages
Advanced Info Extraction Methods
No ratings yet
Advanced Info Extraction Methods
20 pages
Challenges and Advances in Information Extraction
No ratings yet
Challenges and Advances in Information Extraction
29 pages
Anti-Serendipity: Finding Useless Documents and Similar Documents
No ratings yet
Anti-Serendipity: Finding Useless Documents and Similar Documents
9 pages
What Is Enterprise Content Management Guide To ECM 6
No ratings yet
What Is Enterprise Content Management Guide To ECM 6
21 pages
Bag of Words Algorithm - Saanvi XC
No ratings yet
Bag of Words Algorithm - Saanvi XC
3 pages
Acronyms and Shorthands
No ratings yet
Acronyms and Shorthands
3 pages
Soal Try Out BHS Inggris Paket A
No ratings yet
Soal Try Out BHS Inggris Paket A
12 pages
DBMS Exam Questions for BCA III
No ratings yet
DBMS Exam Questions for BCA III
5 pages
Manual - DV
No ratings yet
Manual - DV
51 pages
Library Collection Development Guide
No ratings yet
Library Collection Development Guide
6 pages
Elijah Brown Resume
No ratings yet
Elijah Brown Resume
2 pages
IBM Watson Amazing Thing AI
No ratings yet
IBM Watson Amazing Thing AI
3 pages
Huijun 20250309
No ratings yet
Huijun 20250309
1 page
Severe Injury ML Amputation
No ratings yet
Severe Injury ML Amputation
1,045 pages
Chapter 1 DBMS
100% (1)
Chapter 1 DBMS
32 pages
Aspiring Data Scientist Profile
No ratings yet
Aspiring Data Scientist Profile
3 pages
Oracle HRMS Functional Document: Customization of Location
No ratings yet
Oracle HRMS Functional Document: Customization of Location
6 pages
The Role of Different Types of Information Systems in Business Organizations: A Review
No ratings yet
The Role of Different Types of Information Systems in Business Organizations: A Review
9 pages
Profil Andi Asari
No ratings yet
Profil Andi Asari
3 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Chatgpt Clone
No ratings yet
Chatgpt Clone
36 pages
Communication Engineering Guide
No ratings yet
Communication Engineering Guide
2 pages
Information System Thesis PDF
100% (5)
Information System Thesis PDF
7 pages
Databases and DBMSS: Todd S. Bacastow January 2005
No ratings yet
Databases and DBMSS: Todd S. Bacastow January 2005
37 pages
Data Analyst 1707484551
No ratings yet
Data Analyst 1707484551
1 page
97 Things Every Data-Engineer Should Know Collective Wisdom From The Experts
0% (2)
97 Things Every Data-Engineer Should Know Collective Wisdom From The Experts
11 pages
Case Submission Checklist: 1. Completed and Signed Case Registration Form
No ratings yet
Case Submission Checklist: 1. Completed and Signed Case Registration Form
2 pages
Er Modeling
No ratings yet
Er Modeling
92 pages
Service Oriented Architecture and Design Strategies
No ratings yet
Service Oriented Architecture and Design Strategies
26 pages
DataAnalytics PreSession1Reading - Page30 40 PDF
No ratings yet
DataAnalytics PreSession1Reading - Page30 40 PDF
11 pages
Strayer Resume
No ratings yet
Strayer Resume
1 page
Analytics Engineer
No ratings yet
Analytics Engineer
2 pages

AKMiner Domain-Specific Knowledge Graph Mining

Uploaded by

AKMiner Domain-Specific Knowledge Graph Mining

Uploaded by

AKMiner: Domain-Specific Knowledge Graph Mining

from Academic Literatures

Shanshan Huang and Xiaojun Wan*

Institute of Computer Science and Technology,

Keywords: Knowledge graph generation, academic knowledge extraction,

2.2 Academic Literature Mining

2.3 Markov Logic Networks

3.1 System Overview

visually. Given a set of literatures

3.2 Definition of Concepts and Relations

4 Markov Logic Networks

where ( ) is the number of true groundings of Fi in possible worlds , and Z is the

5 MLN for Extraction of Concepts and Relations

5.1 Concepts and Predicates

• Key_m_outside(concept): A keyword related to Method is detected around the

5.2 Rules in MLN

Rules from Definition.

6.1 Evaluation Setup

Dataset and Evaluation Metrics

6.2 Evaluation Results

Results of Concept Extraction

Table 1. Comparison Results of Concept Extraction

Test Task Concept Method Concept Task + Method

Results of Relation Extraction

Table 2. Results of Relation Extraction

7 Knowledge Graph Generation

7.1 Concept Importance Assessment

Before generating knowledge graphs, we assess the importance of different concepts

7.2 Concept Cloud Graph Generation

Fig. 2. Concept Cloud Graph on “DS” Domain

Fig. 3. Part of Relations Graph on “DS” Domain

7.3 Concept Relation Graph Generation

8 Conclusions and Future Work

Acknowledgments. The work was supported by NSFC (61170166) and National

You might also like