AKMiner Domain-Specific Knowledge Graph Mining
AKMiner Domain-Specific Knowledge Graph Mining
Abstract. Existing academic search systems like Google Scholar usually return
a long list of scientific articles for a given research domain or topic (e.g. “doc-
ument summarization”, “information extraction”), and users need to read vo-
lumes of articles to get some ideas of the research progress for a domain, which
is very tedious and time-consuming. In this paper, we propose a novel system
called AKMiner (Academic Knowledge Miner) to automatically mine useful
knowledge from the articles in a specific domain, and then visually present the
knowledge graph to users. Our system consists of two major components: a) the
extraction module which extracts academic concepts and relations jointly based
on Markov Logic Network, and b) the visualization module which generates
knowledge graphs, including concept-cloud graphs and concept relation graphs.
Experimental results demonstrate the effectiveness of each component of our
proposed system.
1 Introduction
Academic literatures offer scientific researchers knowledge about current academic
progress as well as history in a specific research domain (e.g. “document summariza-
tion”, “information extraction”). By reading scientific literatures, the beginners grasp
basic knowledge of a research domain before in-depth study, and experienced re-
searchers conveniently obtain the information of recent significant progress. However,
relevant academic information is usually overloaded, and researchers often find an
overwhelming number of publications of interests. Digital libraries offer various data-
base querying tools, and Internet search companies have developed academic search
engines. Typical academic search engines like Google Scholar, Microsoft Academic
Search and CiteSeer, all achieve good retrieval performance.
However, most of the academic search systems simply return a list of relevant ar-
ticles by matching keywords or author’s information. Usually, there are quite a lot of
articles for a specific topic, and it is hard for researchers to have a quick glimpse on
the whole structure of knowledge, especially for the beginners.
*
Corresponding author.
X. Lin et al. (Eds.): WISE 2013, Part II, LNCS 8181, pp. 241–255, 2013.
© Springer-Verlag Berlin Heidelberg 2013
242 S. Huang and X. Wan
Another way to catch the development of a research domain or topic is to read re-
view papers, such as survey papers published in ACM Computing Surveys every
year. However, there are usually only a few high-quality review papers available for
most research domains. Besides, the publishing cycle of review papers is relatively
long, compared to fast updating of research achievements. Moreover, review papers
are usually very long, and the knowledge is embedded in texts, which often fails to
show the knowledge structure visually and vividly.
To help researchers relieve the burden of tedious paper reading, and acquire infor-
mation about the recent achievements and developments quickly, we propose a novel
system called AKMiner (Academic Knowledge Miner) to extract academic concepts
and relations from academic literatures and generate knowledge graphs for a given
research domain or topic. Hence, researchers can quickly get a basic vision of a re-
search domain and learn the latest achievements and developments directly.
Our AKMiner system consists of two phases: a) extraction of academic concepts
and relations, and b) academic knowledge graph generation. For the first step, Markov
Logic Network (MLN) is applied to build a joint model for extracting academic con-
cepts and their relations from literatures simultaneously. A concept cloud graph and a
concept relation graph are then generated to visually present domain-specific concepts
and relations, respectively. For concept cloud graph generation, the PageRank algo-
rithm [19] is applied to calculate the importance scores of different concepts in a
domain. The more important a concept is, the bigger it is displayed in the graphs.
Experiments are conducted on datasets in four domains, and the training data and test
data are from different domains. Evaluation results show that our proposed approach
is more effective than several baseline methods (e.g. support vector machine (SVM),
conditional random fields (CRF), C-Value) for knowledge extraction. Case studies
show several good characteristics of the generated knowledge graphs.
The contributions of our study are summarized as follows:
• We propose a novel AKMiner system to mine knowledge graphs for a research
domain, which can be used for enhancing existing academic search systems.
• We propose a joint model based on Markov Logic Network to extract academic
concepts and their relations.
Experimental results on four datasets and cases of knowledge graphs show the
effectiveness of our proposed system.
2 Related Works
2.1 Information Extraction
Information extraction (IE) techniques have been widely investigated for various
purposes in the text mining and natural language processing fields. Typical IE tasks
include named entity recognition (NER), relation detection and classification, and
event detection and classification. State-of-the-art NER systems usually formulate
NER as a sequence labeling problem, and employ various discriminative structured
prediction models (e.g. hidden Markov model, maximum-entropy Markov model,
CRF) to resolve it. Relation detection and classification aims to extract relations
among entities and classify them into different categories. Many statistical techniques
have been investigated to predict entity relations such as [32]. Recently, joint models
Domain-Specific Knowledge Graph Mining from Academic Literatures 243
have been proposed to extract entities and relations simultaneously [20], [21], which
achieve superior performance to pipeline models. Event detection and classification
aims to mine significant events and group them into relevant topics, benefiting more
discussions and comparisons within and crossing topics.
Term extraction or terminology extraction is a subtask of information extraction,
which aims to extract multi-word expressions from a large corpus. Different metho-
dologies for automatic term extraction have been investigated, including linguistic,
statistical and hybrid approaches [5]. Linguistic approaches basically identify terms
by using some heuristic rules or patterns [8]. Statistical approaches usually rank can-
didate terms according to a criterion, which is able to distinguish among true and false
terms and give higher confidence to the better terms [3], [7], [10]. Hybrid approaches
usually combine linguistic and statistical approaches into a two-stage framework [7],
[15]. Keyphrase extraction is a very similar task with term extraction. Most keyphrase
extraction methods first extract candidate phrases with natural language processing
techniques, and then rank the candidate phrases and select the final keyphrases with
supervised or unsupervised algorithms [14], [18], [30].
3 Overview
In this section, we define academic concepts and relations used in our system. To
describe an academic domain, we often present the main tasks in the domain, and
summarize the main methods applied to solve the tasks. Hence, in this study, we focus
on two kinds of academic concepts: Task and Method.
The Task concepts are specific problems to be solved in academic literatures, in-
cluding all concepts related to tasks, subtasks, problems and projects, like “machine
translation”, “document summarization”, “query-focused summarization”, etc.
The Method concepts are defined as ways to solve specific Tasks, including all
concepts describing algorithms, techniques, models, tools and so on, like “Markov
logic”, “CRFs”, “heuristic-based algorithm”, and “XML-based tool”.
The relations extracted by our system include two kinds of relations: the relations
between the two kinds of academic concepts (i.e. Method-Task relations), and the
relations within the same kind of academic concepts (i.e. Method-Method relations
and Task-Task relations). The first kind of relations are formed when a Method is
applied to a referred Task (e.g., “extractive method” for “document summarization”).
The latter kind of relations between Methods or between Tasks are formed by de-
pendency, evolution and enhancements (e.g., “Markov model” and “hidden Markov
model”, “single document summarization” and “multi-document summarization”).
Domain-Specific Knowledge Graph Mining from Academic Literatures 245
Hard Rules
Hard rules describe the hard constraints that should always hold true. These rules are
given a prior weight larger than other rules.
Non-overlapping Rules
We make the rule that three categories of concepts (Task, Method and Other) do not
overlap. That is to say, a concept only belongs to one of the categories.
( ) ! ( ) ! ( ) (1)
( ) ! ( ) ! ( ) (2)
( ) ! ( ) ! ( ) (3)
Soft Rules
These rules describe constraints that we expect to be usually true, but not all the time.
Domain-Specific Knowledge Graph Mining from Academic Literatures 247
Neighbor-based Rules
We assume that two neighbor concepts probably have a relation.
( , ) ( ) ( ) ( , ) (7)
( , ) ( ) ( ) _ ( , ) (8)
( , ) ( ) ( ) _ ( , ) (9)
Keyword-based Rules
We consider keywords as important clues to extract academic concepts. For example,
a concept with the word “algorithm” as suffix is probably a Method, and a concept
with “problem” as suffix is likely to be a Task. But sometimes this rule is not true. For
instance, “efficient method” is not a useful concept. So keyword-based rules are soft.
_ ( ) ( ) (10)
_ ( ) ( ) (11)
In addition, the words around concept phrases also offer much useful information,
such as the words “propose”, “demonstrate”, and “present”. A concept with this kind
of keywords appearing around probably belongs to the related concept category. The
rules are as follows.
_ _ ( ) ( ) (12)
_ _ ( ) ( ) (13)
Containing-based Rules
From texts, we find that if the predicate Contain (c1, c2) is true, c1 likely contains
more modifiers than c2. If c1 is a Task, c2 tends to be a Task, too. For example, Con-
tain(“Spanisah to English MT”, “MT”) ^ Task(“Chinese to English MT”)
Task(“MT”). In this case, c1 helps to determine the category of c2. But if c1 is a Me-
thod, we cannot perform the inference. For example: Contain(“phrase-based MT”,
“MT” ) ^ Method( “phrase-based MT”) Method(“MT”).
However, if c2 is a Method concept, c1 is probably a Method concept. For instance,
Contain (“Hidden Markov model”, “Markov model”) ^ Method (“Markov mod-
el”) => Method (“Hidden Markov model”). On the other hand, if c2 is a Task con-
cept, the inference is unreasonable. For example, Contain (“phrase-based statistical
machine translation”, “machine translation”) ^ Task (“machine translation”)
Task (“phrase-based statistical machine translation”). In all, we have the for-
mulas below.
( , ) ( ) ( ) (14)
( , ) ( ) ( ) (15)
Apposition Rules
Apposition information can also be used for extraction of concepts. Given the predi-
cate Apposition (c1, c2), we add a rule that if two concepts are appositive, they likely
belong to the same category. The rule is built below:
( , ) ( ) ( )
( ) ( ) ( ) ( ) (16)
Actually, keyword-based rules are the basic rules to recognize concepts, and contain-
ing-based and apposition rules help to detect some missing concepts (e.g. conditional
248 S. Huang and X. Wan
random fields) and correct some wrong judgments, so that the concept extraction is
more complete.
Transitivity Rules
Actually, some useful concept pairs may fail to be extracted, so we apply transitivity
in relation inference. It is supposed that if c1 and c2 have a relation, while c2 and c3
have a relation, and then concept c1 and c3 potentially have a relation. However, the
precondition is that they are in the same category. Relations between Task and Me-
thod are not assumed to have transitivity. The rules are represented below:
_ ( , ) _ ( , ) _ ( , ) (17)
_ ( , ) _ ( , ) _ ( , ) (18)
6 Empirical Evaluation
1
The size of the literature set in a domain can range from several to thousands.
2
http://www.acl.org/ The datasets used will be published on our website.
3
http://nlp.stanford.edu/
Domain-Specific Knowledge Graph Mining from Academic Literatures 249
The initial NP set contains many useless phrases, such as “we”, “this paper”, “future
work”, etc. So we build a simple filter to filter out some NPs by using several lin-
guistic rules, and also collect a “stop words” list to abandon some useless NPs or
some useless words in NPs, such as “some”, “many”, “efficient”, “general”, etc. In
addition, too long or too short NPs are also excluded.
Baselines
To verify the performance of our proposed joint model, we develop three baseline
methods (CRF model, SVM classifier, and C-Value method) for concept extraction
and a baseline method (SVM classifier) for relation extraction.
The CRF model [16] can be used to extract academic concepts in literatures. To
make training data for the CRF model, we search the labeled concepts in the litera-
tures and mark the occurrences of the concepts. If one concept occurs in a literature,
each word covered by the concept phrase will be labeled with relevant tags (T, M).
The other words are marked with irrelevant tag (O). When two concept phrases are
overlapping, such as “machine translation” and “statistical machine translation”, we
consider the longer one as the complete phrase. The features used in the CRF model
include the current word, words around the current word, part of speech and keyword-
based features. The lists of keywords are the same as they are for MLN. Besides, to
guarantee the fairness of the comparison, NP features are also used in CRF, including
a word’s position in an NP (i.e. outside, beginning, middle and ending of an NP).
Support Vector Machine [4] is another popular method for information extraction,
classifying NPs into three categories. The features used for SVM include prefix and
suffix information of NP, and keyword-based features. The keyword-based features
also include keywords inside and around concepts, which are used in MLN.
C-Value [10] is an unsupervised algorithm for multi-word terms recognition, and
here we refer to concept phrases. It extracts phrases according to the frequency of
occurrence of phrases, and enhances the common statistical measure of frequency by
using linguistic information and combining statistical features of the candidate NPs.
For concept relation extraction, we use SVM as a baseline. Classifications are con-
ducted on candidate concept pairs, which are combinations of two adjacent concepts.
The candidate concept pairs are classified into two categories, having relations or not.
The features for classification include phrases’ length and position information, rela-
tion related keywords and the keywords’ position information. Phrases’ lengths are
calculated by ignoring brackets and the context inside brackets. The position informa-
tion includes positions in a sentence, the orders of phrase pairs, and the distances
between phrases in pairs. Relation related keywords are also collected manually, in-
cluding phrases like “based on”, “enhancement”, “developed on”, etc.
Alchemy
We utilize the Alchemy system 4 in our experiments. Alchemy is an open-source
package providing a series of algorithms for statistical relational learning and proba-
bilistic logic inference, based on the Markov logic representation.
4
http://alchemy.cs.washington.edu/
250 S. Huang and X. Wan
We can also find that the extraction of Methods is generally more effective than
the extraction of Tasks. Considering the characteristics of academic literatures, they
are usually discussing several methods for one task, so the number of Methods is
certainly larger than that of Tasks, which can be confirmed from the numbers of
extracted concepts in Table 1. With more ground atoms of Method in training data,
the MLN model can learn a better model and performs more effectively for Method
extraction.
Finally, we can conclude that our joint model helps to extract concepts in a more
comprehensive way by taking into account global information and the relations be-
tween concepts. This advantage will be more notable in relation extraction.
importance of other concepts that have relations with this concept. We have two
assumptions:
• The more frequently a concept appears, the more important it is.
• The more important the other concepts related to the current concept are, the more
important the current concept is.
In addition, the position where a concept occurs is also considered into importance
assessment. For instance, a concept occurs in title will get a larger importance weight
than the ones in other sections. The position-based weight of a concept is multiplied
with the frequency it occurs in the corresponding section. In this study, each concept’s
prior importance score is given by its frequency, and the frequency of a concept oc-
curs in title is multiplied by a weight (we assign 1.2 in experiments).
In the experiments, the personalized PageRank algorithm [12] is applied for con-
cept importance assessment. The personalized PageRank algorithm can take into ac-
count both the prior scores and the concept relations, and the importance score of a
concept is iteratively calculated until convergence.
5
http://www.wordle.net/advanceds
Domain-Specific Knowledge Graph Mining from Academic Literatures 253
References
1. Abu-Jbara, A., Radev, D.: Coherent Citation-Based Summarization of Scientific Papers.
In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguis-
tics, pp. 500–509 (2011)
2. Agarwal, N., Gvr, K.: SciSumm: A Multi-Document Summarization System for Scientific
Articles. In: Proceedings of the ACL-HLT 2011 System Demonstrations, pp. 115–120
(2011)
3. Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography.
In: ACL 1989, pp. 76–83 (1989)
4. Cortes, C., Vapnik, V.: Support-vector Networks. Machine Learning 20(3), 273–297
(1995)
5. Daille, B.: Combined approach for terminology extraction: lexical statistics and linguistic
filtering. Technical Report (1995)
6. Dunne, C., Shneiderman, B., Gove, R., Klavans, J., Dorr, B.: Rapid Understanding of
Scientific Paper Collections: Integrating Statistics, Text Analytics, and Visualization. Uni-
versity of Maryland, Human-Computer Interaction Lab. Tech. Report HCIL-2011 (2011)
7. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computa-
tional Linguistics 19(1), 61–74 (1994)
8. Earl, L.L.: Experiments in automatic extracting and indexing. Information Storage and Re-
trieval 6(X), 273–288 (1970)
9. EL-Arini, K., Guestrin, C.: Beyond Keyword Search: Discovering Relevant Scientific Li-
terature. In: Proceedings of the 17th SIGKDD (2011)
10. Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the C-
value / NC-value method. International Journal of Digital Library 3, 115–130 (2000)
11. Han, H., Giles, C., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic Document Me-
ta-data Extraction using Support Vector Machines. In: Proceedings of Joint Conference on
Digital Libraries (2003)
12. Haveliwala, T.H.: Topic-sensitive pagerank. In: WWW 2002, pp. 517–526. ACM, New
York (2002)
13. Isaac, Councill, G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF reference string
parsing package. In: Proceedings of the Language Resources and Evaluation Conference
(LREC 2008), Marrakesh, Morrocco (May 2008)
14. Jiang, X., Hu, Y., Li, H.: A ranking approach to keyphrase extraction. In Microsoft Re-
search Technical Report (2009)
15. Justeson, J., Katz, S.: Technical terminology: some linguistic properties and an algorithm
for identification in text. Natural Language Engineering 1, 9–27 (1995)
16. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)
17. Li, N., Zhu, L., Mitra, P., Mueller, K., Poweleit, E.: oreChem ChemXSeer: a semantic dig-
ital library for chemistry. In: JCDL (2010)
18. Mihalcea, R., Tarau, P.: TextRank: Bringing order into texts. In: Proceedings of EMNLP
2004 (2004)
19. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing
order to the web. Technical report, Stanford Digital Libraries (1998)
20. Poon, H., Domingos, P.: Joint inference in information extraction. In: Proceedings of
AAAI 2007 (2007)
Domain-Specific Knowledge Graph Mining from Academic Literatures 255
21. Poon, H., Vanderwende, L.: Joint inference for knowledge extraction from biomedical lite-
rature. In: Proceedings of the 2010 Annual Conference of the North American Chapter of
the Association for Computational Linguistics (2010)
22. Poon, H., Domingos, P.: Joint unsupervised coreference resolution with Markov logic. In:
Proceedings of the Conference on Empirical Methods in Natural Language Processing
(2008)
23. Qazvinian, V., Radev, D.R.: Scientific Paper Summarization Using Citation Summary
Networks. In: Proceedings of COLING 2008, vol. 1, pp. 689–696 (2008)
24. Richardson, M., Domingos, P.: Markov Logic Networks. Machine Learing 62(1-2),
107–136
25. Shahaf, D., Guestrin, C., Horvitz, E.: Metro Maps of Science. In: Proceedings of the 18th
ACM SIGKDD (2012)
26. Singla, P., Domingos, P.: Entity resolution with markov logic. In: Proceedings of ICDM
2006 (2006)
27. Singla, P., Kautz, H., Luo, J.: Discovery of social relationships in consumer photo collec-
tions using Markov Logic. In: Workshops of CVPRW 2008 (2008)
28. S.K., Kan, M.: Scholarly paper recommendation via user’s recent research interests.
In: JCDL (2010)
29. Kondo, T., Nanba, H., Takezawa, T., Okumura, M.: Technical Trend Analysis by
Analyzing Research Papers’ Titles. In: Vetulani, Z. (ed.) LTC 2009. LNCS, vol. 6562,
pp. 512–521. Springer, Heidelberg (2011)
30. Wan, X.J., Xiao, J.G.: Single document keyphrase extraction using neighborhood know-
ledge. In: Proceedings of AAAI 2008 (2008)
31. Yeloglu, O., Milios, E., Zincir-Heywood, N.: Multi-document Summarization of Scientific
Corpora. In: SAC (2011)
32. Zhu, J., Nie, Z., Liu, X., Zhang, B.: StatSnowball: a statistical approach to extracting entity
relationships. In: Proceedings of 18th WWW Conference (2009)