Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views56 pages

Citation Data-Set For Machine Learning Citation ST

Uploaded by

Mohamed Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views56 pages

Citation Data-Set For Machine Learning Citation ST

Uploaded by

Mohamed Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

arXiv:1805.04798v1 [cs.

DL] 12 May 2018

Citation Data-set for Machine Learning Citation


Styles and Entity Extraction from Citation
Strings

Niall Martin Ryan


B.A.(Mod.) Computer Science
Final Year Project April 2018
Supervisor: Prof Dr Joeran Beel
Co-Supervisor: Dr Dominika Tkaczyk

School of Computer Science and Statistics


OReilly Institute, Trinity College, Dublin 2, Ireland
Declaration
I hereby declare that this project is entirely my own work and that it has not been
submitted as an exercise for a degree at this or any other university

Signed:

Date:

Niall Martin Ryan

i
Abstract

Citation Data-set for Machine Learning Citation Styles and Entity


Extraction from Citation Strings

by Niall Martin Ryan

Citation parsing is fundamental for search engines within academia and the protection
of intellectual property. Meticulous extraction is further needed when evaluating the
similarity of documents and calculating their citation impact. Citation parsing involves
the identification and dissection of citation strings into their bibliography components
such as ”Author”, ”Volume”,”Title”, etc. This meta-data can be difficult to acquire
accurately due to the thousands of different styles and noise that can be applied to a
bibliography to create the citation string. Many approaches exist already to accomplish
accurate parsing of citation strings. This dissertation describes the creation of a large
data-set which can be used to aid in the training of these approaches which have limited
data.It also describes the investigation into if the downfall of these approaches to citation
parsing and in particular the machine learning based approaches is because of the lack
of size associated to the data used to train them.
Acknowledgements
Firstly, I would like to thank my supervisor Jöran Beel. I could not have done this
without his guidance, expertise, assistance and generous availability for queries and
meetings alike.

I’d also like to thank Dominika Tkaczyk for her help throughout this project, especially
for insight and experience within the field.

I would like to thank my friend Owen Mooney for helping me get the project off the
ground with his advice about scraping.

Thank you to Michel Krmer, for answering my queries about the software he has devel-
oped for BIBTEX/citation manipulation.

I thank Conor Fulham for his immaculate proof reading and for carrying me on tough
days.

Finally, I would like to thank my family for their constant support and tolerance of
me over the duration of this project. Especially my father Declan Ryan for his helpful
planning and late night discussions.

iii
Contents

Declaration of Authorship i

Abstract ii

Acknowledgements iii

List of Figures vi

List of Tables vii

Abbreviations viii

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Research Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 State of the Art 9


2.0.1 Pre-Processing of documents . . . . . . . . . . . . . . . . . . . . . 12

3 Methodology 16
3.1 Introduction to Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Security and Reliable Sources without inaccuracies . . . . . . . . . 16
3.2 Finding efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 DBLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Scraping under the radar . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Parsing & Removal of Impurities . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 The ’HomePage’ problem . . . . . . . . . . . . . . . . . . . . . . . 22

4 Conversion of BibTeX & Building the Data-Set 23


4.1 Citation Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Annotating Citation References . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 CiteProc Software & Efficiency Challenges . . . . . . . . . . . . . . . . . . 24
4.4 Structure of the Data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iv
Contents v

4.5 Meta-Data Field consistency & Noise . . . . . . . . . . . . . . . . . . . . . 26


4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Future Work & Data-Set Usability 31


5.1 The Process of Evaluation Testing . . . . . . . . . . . . . . . . . . . . . . 31
5.1.1 Comparison Method . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

A Terms 35

B Information on BibTeX 36

v
List of Figures

1.1 Example of Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


1.2 Example BIBTEX Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Design Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Simple Example of Hidden Markov Model . . . . . . . . . . . . . . . . . . 11


2.2 Shows Layout of a CRF System . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Shows number of lines scraped over 2-3 day interval data points . . . . . . 17
3.2 Google Captcha page upon requesting too fast . . . . . . . . . . . . . . . 18
3.3 Examples of link parameters and the iterative IDs associated . . . . . . . 19
3.4 Shows progression of lines gathered for each individual scraper . . . . . . 20
3.5 Shows progression of lines gathered using scrapers. It is also normalized
by the most efficient data point. . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Number of Fields in BIBTEX entries, distributed over the different scrapers 30
4.2 Number of Types of BIBTEX entries, distributed over the scrapers . . . . . 30

5.1 Shows the dissection and pattern matching of a typical research article . 32

vi
List of Tables

4.1 Number of Fields in BIBTEX Entries for each Scraper . . . . . . . . . . . . 28


4.2 Types of BIBTEX Entries for each Scraper . . . . . . . . . . . . . . . . . . 28

vii
Abbreviations

AI Artificial Intelligence
CAPTCHA Completely Automated Public Turing test to tell Computers and Humans Apart
CRF Conditional Random Field
CSL Citation Style Language
DDOS Distributed Denial Of Service
DOS Denial Of Service
HMM Hidden Markov Model
IP Internet Protocol
RAM Random Access Memory
RID Random Icremental Delay
SVM Support Vector Machines
TD Threshold Delay
XML eXtensible Markup Language

viii
I dedicate this to my family. Declan, Caroline and Aidan for their
love and support

ix
Chapter 1

Introduction

In the past decade there has been an explosion in the amount of scientific publications
accessible on many different library websites. In order to search and use these libraries
effectively there is a amplifying demand for accurate organization. Organizing these
articles by using the meta-data associated with them, allows for more efficient searching,
calculations of citation impact and recommending similar articles [1]. Many of the
most accurate citation parsers or meta data extraction approaches are through machine
learning strategies and systems [2]. The overall accuracy of machine learning based
approaches depends heavily on large amounts of diverse training data. In Tkaczyk et
al [2], a data-set of near 400,000 unique reference strings is used to compare ’out of the
box’ tools vs trained versions of the same tools. Their results found increases of 2-4% for
already very accurate tools and up to 16% for the weaker ’out of the box’ tools. They
are looking in subsequent research to use other larger data-sets which are lacking in this
area. Cermine [3–5] is an open source meta data extraction system which aims to deal
with issues in this area, such as the large amount of possible layouts and styles that
can be used, as well as the complications that the PDF format brings by not preserving
delicate information within articles. Particularly the documents structure and layout
which impede upon meta data extraction. This information usually needs to be reverse
engineered for it to be recovered. They used a data-set referred to as GROTOAP2 [6]
which contained ”13,210 ground truth files” within its sample. It also contained errors
because of non manual preparation/altering which meant it had both segmentation and
labelling errors. Cermine combined 3 small open source data sets for its reference parsing
evaluation.CiteSeer [7], Cora-ref [8] and open source PubMed [9] references. Cora-ref
contained a total of 50,000 collected papers. These are not only out of date, but rather
small in comparison to the data-sets that exist in other areas of machine learning. Which
are truly needed in order to receive accurate results. ParsCit [10] is another machine
learning based citation parser which is planning to expand there successful research into

1
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

larger training data-sets. ParsCit is currently implemented in a digital library called


CiteSeerX. The difficulty of reference parsing and why this data-set is needed within
this area is firstly due to the sheer number of different styles that are used within the
academic community, that span in the thousands. Secondly because of human error and
noise created by automated sources, a rule/template approach cannot account for these
kind of situations. A machine learning based approach is better suited to learning and
adapting to the discrepancies within large data-sets

This project aims to help and improve on current meta-data extraction techniques by
creating a very large reference data-set to train on machine learning based citation
parsing tools.

2
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

This shows examples of just a few stylings for the same reference

Argon C. & McLaughlin S. W. 2002. A parallel decoder for low latency decoding
of turbo product codes. IEEE Communications Letters 6: 7072.
Argon C, McLaughlin SW. 2002. A parallel decoder for low latency decoding of
turbo product codes. IEEE Communications Letters. 6(2):7072
Argon, C. and McLaughlin, S. W. (2002), ’A parallel decoder for low latency
decoding of turbo product codes’, IEEE Communications Letters, 6:2, pp.
7072. https://doi.org/10.1109/4234.984698
1. Argon C, McLaughlin SW. A parallel decoder for low latency decoding of
turbo product codes. IEEE Communications Letters 2002; 6: 702. Available
at: https://doi.org/10.1109/4234.984698.
Argon, Cenk, and Steven W. McLaughlin 2002 A parallel decoder for low latency
decoding of turbo product codes. IEEE Communications Letters 6(2). IEEE
Communications Letters: 7072. Retrieved. from
https://doi.org/10.1109/4234.984698.
ARGON, Cenk; MCLAUGHLIN, Steven W. A parallel decoder for low latency decoding
of turbo product codes. IEEE Communications Letters, IEEE Communications
Letters. v. 6, n. 2, p. 7072, 2002. Disponvel em:
<https://doi.org/10.1109/4234.984698>.
1. Argon C, McLaughlin SW. A parallel decoder for low latency decoding of
turbo product codes. 2002;6:7072.
Argon C., McLaughlin S.W., 2002. A parallel decoder for low latency decoding
of turbo product codes. IEEE Communications Letters, 6 (2): 7072, doi:
10.1109/4234.984698

Figure 1 - A single reference shown in different example citation styles

1.1 Motivation

Accurate citation parsing is a necessity within academic search engines and for the secu-
rity of intellectual property. Automated extraction of bibliographic data, such as article
titles, author names, abstracts, and references is essential to the affordable creation of
large citation databases [1]. It aids in the identification of correlating documents and
evaluating the citation impact of publications by researchers/scholars. For more effec-
tive research methods in this area, categorization and the highlighting of articles with
strong correlation is crucial. Fedoryszak et al [11, 12] reasons that parsing is generally
the first step before citing documents are identified within a collection, breaking up the
process into phases. segmentation and entity resolution. In another paper they describe
3
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Figure 1.1: Example of Reference

"[0] - Galli, Luca and Fraternali, Piero and Bozzon, Alessandro, On the
Application of Game Mechanics in Information Retrieval, in Proceedings of
the GamifIR 14, 2014."

how parsing is pivotal to a ”matchers” effectiveness. The parser attempts to identify


fragments of raw citation strings which contain substantial information such as author,
year, source.

Citation parsing refers to identifying and extracting a reference from the full text and
classifying its elements into fields such as author, journal, publication year etc. from
the bibliography. To offer an example of a citation parsers function, the parser may
likely identify the citation marker [2] first but in regards to this area it may just be-
gin with extraction of the bibliography, extract from the bibliography the first entry
’Galli’, ’Luca’ and ’Fraternali’ which are the authors from 1.1. Figure 1.2 is an exam-
ple of one such BIBTEX sample which provides meta-data about the document. Many
applications feature document retrieval elements, such as recommendation systems and
search engines. Recommendation systems depend immensely upon the accuracy of ref-
erences and reference parsing. Within Beel et al [1, 13] they describe the need for open
source massive access to citation data and full scholarly articles. It relies heavily on
accurate references to resolve ”dis-ambiguity of author names” as well as identification
of duplicate papers. Inside Mr. DLib a open source, non-profit recommendation system
as a product, describes recommender systems in academia as helping ”researchers and
scientists overcome information overload” [14]. Recommendation systems depend on
accurate citation parsing due to the need to calculate cite proximity, making accurate
connections without confusion of reference meta data as well as to maximize the quan-
tity of documents recommended. ”Citations play important roles for both the rhetorical
structure and the semantic content of the articles, and as such, citation information has
shown to benefit many text mining tasks including information retrieval, information
extraction, summarization, and question answering”[15].

1.2 Research Problem

Over the years many approaches to reference parsing have been proposed, including
regular expressions, knowledge-based approaches and supervised machine learning. Ma-
chine learning-based solutions, in particular those falling into the category of supervised
sequence tagging, are considered a state-of-the-art technique for reference parsing. They

4
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Figure 1.2: Example BIBTEX Entry

still suffer from two issues: the lack of sufficiently big and diverse data sets and prob-
lems with generalization to unseen reference formats. In particular the need for an ’open
source’, ’massive’ data set is the main subject of this research paper. Cora is one exam-
ple of a small, outdated data set which is still used in recent years [8], containing 50,000
PDFs. CiteSeerX, Wu et al [16] currently contains over 4 million documents containing
citations, at a nice moderate size but still needs to be parsed and styled in order to
train an ML algorithm. An ’out-of-the-box’ massive, open source training data set is
crucial for the improvement of citation parsing tools based on CRF machine learning
approaches. Lipinski et al [17] uses arXiv as it’s data set for training and evaluation.
An extremely small set was used for evaluation testing, ”we obtained 1,153 random PDF
articles including their metadata, dated from 2006 to 2010”. Yu et al [18] used the cora
dataset as well as one by Han, They used 350 ”reference items” randomly for training
and 150 for evaluation testing. [19] mentions a manually built database ”Institute for
Scientific Information” and also that of citeSeer using 2750 citation strings as training
data. Zhang created a database by automating scraping of PubMed journals, randomly
selecting 2% from each journal so that it was diverse. 672 articles were acquired which
lead to a total number of citations strings of 27,606. This is rather small data set with
impressive results. A last noteable comment is made about supervised machine learning
algorithms being ”generalizable and the learning framework is robust, as long as training
data is made available” [19] which comes back to the idea that Cameron spoke about
in which free open source data sets and libraries should be made available. Lin et al
[20], obtains an initial 260 articles from PubMed which were subsequently dropped to
93 which were manually annotated and used for training and evaluation. In Pinto et al
[21] they use CRFs and refer to how they use sums which contain ”gaussian prior over
parameters” and variance in order to help with the scarcity of training data available.
Penga et al (2013) [22] were still using benchmark data sets that were small and out
of date, compensating in other ways to make up for the lack of data, which happens to

5
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

also use CRFs in their approach. This quote sums up the situation of small datasets
”This is particularly important for the scientific publishing domain, where currently
most of the datasets are being created in an author-driven, manual manner” [23]. They
also describe the influence of the cora data-set at the time, ”The CORA dataset is the
first gold standard created for automatic reference chunking. It comprises two hundred
reference strings and focuses on the computer science area” [23].

Papers currently evaluating and improving upon CRF based tools are using data sets
which are not publicly available [2]. It would appear that those data sets are being used
for commercial use, which may or may not benefit the progression of technology in this
area. Google Scholar must be the most prolific for easy access to individual BIBTEX
entries but remains under lock and key as could be expected.

For deep learning, much larger data-sets would be needed to train a sophisticated model
than exist today. Early research in this area used rule-based applications which is
loosely based around the expert knowledge of someone in a specific domain. Patterns
tend to emerge, that the expert can then formulate in a method to parse such strings.
Rule-based methods have some draw-backs and only really have success in specific small
to moderate journal areas due to the fact that journal publishers regularly are given
specifics on what predefined citation styles to use. These rule-based methods are also
less adaptable and are difficult to optimize under these circumstances.

1.3 Research Goal

The research goal is to build a massive, open source data set which can be used for
training high end citation parsing machine learning tools and evaluate their performance.
This goal can be broken into gathering BIBTEX entries, citation generation into thousands
of different styles and annotating of citations. The BIBTEX entries can be scraped from
many different electronic libraries and recommender systems. These BIBTEX entries are
the meta data we plan to retrieve when the citation parsing tools are finished with the
references. In reference to the Figure 1.3, Annotated and Standard references are needed
in order to train the machine learning based tools. These references are obtained by ways
of CiteProc [24], which takes in a CSL template file and a BIBTEX entry, returning the
output of a reference string. Each BIBTEX will be converted into thousands of stylized
references. Then the data set can be used to evaluate citation parsing tools using the
annotated and standard references.

6
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Figure 1.3: Design Schema

Figure 1.3 - Project Schema layout - BIBTEX entries are scraped from the internet. The
CiteProc Converter takes in these BIBTEX entries along with a standard CSL file for
output of a reference and also a altered CSL file which will give output as an annotated

7
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

reference. Both of these outputs are stored in the data-set alongside the original scraped
BIBTEX. The Data-set is then passed on for evaluation testing.

8
Chapter 2

State of the Art

Numerous methodology have been used in research articles to solve the generalized
problem of meta data extraction in reference strings. One of which is ”A structural
SVM approach for reference parsing” [1]. This approach involves using structural sup-
port vector machines which is a supervised machine learning algorithm/method used
for forecasting structured outputs. The feature extractions featured in ”A structural
SVM approach for reference parsing” involves preliminary processing and splitting the
reference into individual word tokens based on criteria [15]. They use these features to
identify different entities that pattern match to their outputs. A second method pro-
poses to avoid content based analysis due to their unreliability and instead proposes a
Web-based CME approach and a citation enriching system, called as BibAll, which is
capable of correcting the parsing results of content-based CME methods and augment-
ing citation meta-data by leveraging relevant bibliographic data from digital repositories
and cited-by publications on the Web [25]. CiteSeer, which is a citation indexing system
that refers to a large freely view-able database of scholarly articles with citation and bib-
liography meta-data which updates automatically. This idea is referenced from a paper
Cameron wrote on the topic [26]. ”Digital libraries and autonomous citation indexing”
created a way in which authors didn’t have to expend their energies to see their work
in the database [27]. It instead grabs the papers from the web in pdf format, grabs the
citation and its context then uploads it to a database in order to advance the ease of
searching for literature and its assessment. Another proposed system in 2007 suggests
an unsupervised system and does not require or rely on patterns encoding specific de-
liminators of a particular citation style [28]. The model they propose involves 4 steps,
splitting the string into its different element blocks. matching an associated meta-data
field with an element block. Takes unmatched element blocks and examines them for
more associations based off their position within the citation. It collects all of the ele-
ment blocks and chains them together when they are associated to the same meta-data

9
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

field. Their results were as follows.”the extraction quality obtained by FLUXCiM, even
without user intervention, reached F-Measure levels above 92% [28].

Template matching approaches uses syntactic pattern matching against a knowledge


base of known templates. The citation is put against these templates to find the closest
match, upon success the best fit template is used to label the tokenized version of
the citation as fields[10]. An authoritative example of this in action can be seen in
ParaTools [29]. These perl modules hold an estimated 400 templates which are used to
match against reference strings. These templates may encompass a large selection of
references but it manifests scope problems where not all types of references are covered
within these templates. This approach does not scale effectively due to the exercise of
adding templates manually.

Yin et al [30] uses ’Bigram’ HMMs to extract perform meta-data extraction. A HMM
is a statistical Markov model, a finite state machine containing appropriate state tran-
sitions and symbol output/emissions. The process involves the production of a string
of symbols by a starting state symbol, transitioning to other states, outputting symbols
dictated by the current state of the model until it reaches a concluding state. Probabil-
ity distributions are associated with each set of states over the symbols that are in the
emission set and a probability distribution over the output transitions. A reference can
be observed as a arrangement of fields such as ’Author’, ’Title’, etc. For a HMM, every
state within can be marked as a label associated with one of the above fields mentioned.
The extraction process proceeds by determining the arrangement of states that is most
probabilistic for generating the complete reference and then placing corresponding labels
with their fields according to the sequence of the states. Yin et al [30] uses a ”dynamic
programming solution called the viterbi algorithm” which solves the problem in O(T N 2 )
time. Another HMM by Hetzner et al [31] describes a similar approach without the use
of ’Bigrams’.

A simple example can be seen in figure 2.1.

Within ParsCit [10], they use what is known as a condition random field (CRF) [32]
format to learn a model from an annotated reference data-set that can be utilized on
hidden or otherwise unknown references.The ParsCit system works relatively well on
a small evaluation data sets with good accuracy performance. This model of learning
allows for adequate scaling in regards to ’conditionally dependent features’ that may
overlap. They use a variety of manually engineered features to improve on error prone
past systems. Just to list a few of these features from ParsCit [10].

10
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Figure 2.1: Simple Example of Hidden Markov Model

Figure (2.1) shows a simple Hidden Markov Model showing the different states with
their attached probabilities for each transition as well as the outputs that are possible
from each [33].

Token identity (3): We encode separate features for each token in three different forms:
1) as-is, 2) lowercased, and 3) lowercased stripped of punctuation.

N-gram prefix/suffix (9): We encode 4 features for the first 1-4 characters of the to-
ken, similarly for the last 1-4 characters. A single feature also examines the last
character of the token, encoding whether it is uppercase, lowercase or numeric.

Orthographic case (1): We analyze the case of the token, assigning it one of four values:
Initialcaps, MixedCaps, ALLCAPS, or others.

Punctuation (1): Similarly, we give fine-grained distinctions for the punctuation present
in token: leadingQuotes, endingQuotes, multipleHyphens (occasionally found in
page ranges), continuingPunctuation (e.g., commas, semicolons), stopPunctuation
(e.g., periods, double quotes), pairedBraces, possibleVolume (e.g., 3(4)), or others.

CRF is by and large the most effective and accurate method of meta data extraction.
With GROBID [34] having extremely strong results in precision and recall - ”GROBID
provided the best results over seven existing systems, with several metadata recognized
with over 90% precision and recall”. In [2], the top most accurate and performance heavy

11
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

citation parsers were all in the category of CRF using small to medium sized data sets, in
the range of 200,000 - 420,000 reference strings. ”We leverage higher order semi-Markov
conditional random fields to model long-distance label sequences, improving upon the
performance of the linear-chain conditional random field model.” [35]

[34–36]

2.0.1 Pre-Processing of documents

Pre-Processing is important because of the crucial need for accuracy when calculating
citation density, connections and making recommendations. It is needed because of
errors obtained in a lot of databases because of human error upon citing and creating
references and also through formatting issues which do not maintain the consistency of
the documents different sections. Leaving citation parsers bewildered. It is needed for
a large, public data set because of the current open source data sets which seem to be
error prone in several manners, the need for accurate, precise data is growing, as all of
the libraries and web of citation systems grows exponentially. The normal setting for
extracting these reference strings is through document formats such as PDF. PDFs can
cause a lot of noise within these references in the PDF if the reference contains special
characters or escape characters before being added. Finding these references is the first
step in extracting them. ”Given a plain UTF-8 text file, ParsCit finds the reference
strings using a set of heuristics. It begins by searching for a labeled reference section in
the text. Labels may include such strings as References, Bibliography, References and
Notes, or common variations of those strings. Text is iteratively split around strings that
appear to be reference section labels” [10]. A lower bound is put on where a reference
section label can appear to which it is considered to be accurate, this criterion is set
to around 40% by default. The last match of reference section labels is treated as the
reference section to analyze. Reference strings are then extracted based on heuristics
and citation markers such as square brackets or parenthesis.

12
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Figure 2.2: Shows Layout of a CRF System

Figure (2.2) Shows a complicated citation parsing system using conditional random
fields [25].

13
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

In a supervised machine learning-based approach, reference parsing is more formally


expressed as a sequence tagging problem. In this type of problem, the input consists
of a sequence of features, with the objective being to assign a corresponding chain of
labels. This will bear in mind that it is not only the features that are a factor but
also dependencies between the adjoining labels within the chain. Tokenization must be
operated on the input reference string, this involves fragmentation. Each fragment is
generally referred to as a token. Upon successful fragmentation, each token is appointed
a label by the sequence tagger. Each label assigned usually coincides with desired meta-
data fields and other special labels.Similar labels are concatenated to form the finished
meta-data fields.

Cermine tkaczyk et al (2015) is another both supervised and unsupervised machine


learning system for comprehensive meta data extraction. ”The evaluation of the extrac-
tion work flow carried out with the use of a large data set showed good performance for
most meta data types, with the average F score of 77.5%” [37].

Machine learning citation parsers perform better on average. This difference is largely
due to recall. ”The average recall for ML-based tools (0.66) is three times as high as
non-ML-based tools (0.22). At the same time, the difference in average precisions is
small (0.77 for ML-based tools vs. 0.76 for non-ML-based tools). The reason for this
might be that it is relatively easy to achieve good precision of manually developed rules
and regular expression, but it is difficult to have a high enough number of rules, covering
all possible reference styles.” [2]. Tkaczyk et al

The reason a bigger data-set is in need, for training these machine learning based ap-
proaches instead of trying to find better algorithms or better machine learning models is
because of the current trend in more data is always better. A famous quote from Peter
Norvig a research director at Google asserts ”We dont have better algorithms. We just
have more data.”. Of course this is not always necessarily true, but in general more data
is rarely a negative impact on results. A paper describing the means by which human
complexities cannot be extrapolated through mathematical equations and instead accept
those complexities, use what is useful and available. Data Halevy et al (2009) [38]. The
current data-sets available such as the cora data-set are of a reasonable size but nothing
that could be considered massive [39]. Especially in [2] they found pretty outstanding
results which show major increases in significant field extraction accuracy’s when using
reasonably sized data-sets in the range of 100,000 unique references to 500,000. Due to
these results, it indicates that by increasing the size of these data-sets to be massive.
Performance of the models will also increase, be it a small amount of 2% for well trained
algorithms or much higher percentages in the high 20’s for weaker ones. It will answer
the question. Will a threshold be revealed on the performance of these models? Which

14
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

indicates that better algorithms are the way forward in this area or that the upper bound
may not exist on this problem and more data will just continue to make these citation
parsers better.

15
Chapter 3

Methodology

3.1 Introduction to Scraping

After trudging through the internet looking for data-sets that match my criteria and
coming up empty. Automated scraping seemed like the most effective solution consider-
ing there were many sites that offer articles in BibTeX format with the click of a button.
Scraping involves the requesting of html pages/files from servers, this usually involves
making many requests to the same web page in order to grasp as much data as possible.
This can also be referred to as DDOS if the frequency of requests gets high enough to
interrupt the service of the website which happens to be illegal. A lot of systems are in
place to stop possible DDOS’ing and we need to avoid those systems in order to allow
our scraper to continue collecting for long periods. 99% of the time, the system knows
you are an AI and you are scraping but if you are ”bashful” enough, it won’t be worth
the servers time to IP ban you as your are not worth their resources. To explain briefly
what I mean by being ”bashful” is referring to quite slow parameters for the scraper
i.e. wait a rather long time between requests, use different media devices in headers per
request, make it so it is not worth the sites time to ban you. There are a lot of pre-built
scrapers for specific sites that can be used, alas mine are rather particular needs and
I built my own scrapers in python. Scraping efficiently while not getting caught is a
difficult balancing act and many elements affect the speed at which you collect.

3.1.1 Security and Reliable Sources without inaccuracies

All of the data scraped was freely available to the public and I don’t need to consider data
protection laws when porting it to a database. A main concept behind this investigation
was making sure the data collected was of good standard from reliable sources so that

16
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Figure 3.1: Shows number of lines scraped over 2-3 day interval data points

Figure (3.1) shows

the data did not contain errors or artificial entries. To achieve these goals of clean data,
anytime I came across a suitable library I would manually check the accuracy of their
BIBTEX files through 10 trial runs. Which involved inspection of fields for errors in
the text, such as syntax and noise. These were compared against other libraries which
contain the same article to get a precise measure of their accuracy. These points are
important to the overall goal, of supplying a large, open source data set because if the
data is retaining inaccuracies it will not be useful when trying to train these parsing
tools.

3.2 Finding efficiency

Google Scholar is a excellent resource for finding articles that suit your research tastes
or articles that relate to other topics through citations. This is why scraping it seemed
ideal for my data-set needs as a reliable product. Although upon further investigation
citation counts were susceptible to manipulation by authors and spam alike [40, 41].
I underestimated the capabilities of Google’s ability to track suspicious requests and
block their requests. The structure in which they store their data on the site also meant
that retrieving the BIBTEX would take at least 3 requests for 1 return entry. This just

17
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Figure 3.2: Google Captcha page upon requesting too fast

made it an impossibility to scrape even when being ’bashful’. In a test scrape of Google
Scholar I found that I was given the following image within 37 entries collected, which
is overwhelmingly impressive. ACM was a honourable mention as it did indeed block
my ip address on occasion.

Captchas are a great way of checking for AI by giving the requester a non-automated
readable question, mostly images that contain obscure looking characters, although I am
sure there are machine learning models out there to solve these problems. An important
aspect of the search for fast requests was on quite a basic but tedious level. How do
I make sure I am getting at least one entry per request? To take the Google scholar
example once more, The only way that I could iteratively and uniformly scrape entries
was through 3 requests, 2 of which were for information which supplied the third. This
was a ridiculous amount of overhead in contrast to the return I was receiving for this
example. This lead to the search for efficiency through 1 request per one entry or better.
This consecutively involved looking for hidden links/urls which were iterative and did
not need more than 1 request. When I say iterative I mean does the link itself contain
an identification number for each entry that can be iterated through with requests. If
the ID’s are random instead of uniform, this may cause the scraping of undesirable data.

These are some examples of what I referred to as ’hidden links’, only because it involved
opening up developer tools in your favourite web browser and following the network
sequence of requests and headers which needed to be analyzed to find the correct one.
’@’ symbols indicate that a descriptive placeholder is in the example instead of a con-
formative example.

18
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Figure 3.3: Examples of link parameters and the iterative IDs associated

ACM

https://dl.acm.org/downformats.cfm?id=965498&parent_id=&expformat=bibtex

CCS
https://liinwww.ira.uka.de/csbib/ + query
parameters[’query’,’results’,’sort’,’start’,’maxnum’]
TeXMed
http://www.bioinformatics.org/texmed/cgi-bin/list.cgi?PMID=29168941
DBLP

dblp.uni-trier.de/pers/tb2/@FirstNameBeforeComma:@NameAfterCommaSeparated
ByUnderscores.bib
MS
https://academic.microsoft.com/cite/Bibtexs?pID=1577993811

IEEE

http://ieeexplore.ieee.org/xpl/downloadCitations?citations-format=citation-on
ly&download-format=download-bibtex&recordIds=8110474

3.2.1 DBLP

On account of most of the scrapers having id’s for each of their BIBTEX entries, I was
on average getting 1 BIBTEX entry per 1 request which is good depending on the speed
of your requests and if you can cut the time down by running things in parallel or
concurrently. DBLP, which is a computer science bibliography library, had the same
indexing system, but I needed results faster than the ’1 for 1’ which was occurring for
the other scrapers mainly due to time constraints for the project. After some digging,
I found that if you got an authors details you could download their articles as BIBTEX
or scrape it from a view they provided. This increased the productivity and efficiency
of the scrapers by quite a large range considering it was now 1 - N returns per request.

In figure 3.4 and figure 3.5, it is clear that they correlate and explain each other rather
well. The high starting efficiency’s in 3.5 are because of the output of CCS which had
rather effective requesting and a relatively low request delay. This eventually drops off
because the scraper finished incredibly fast and had nothing else to collect. Then as
soon as DBLP begins , the efficiency’s go up to a steady amount until the end of the
entire data gathering section.

19
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Figure 3.4: Shows progression of lines gathered for each individual scraper

Figure (3.4)

Figure 3.5: Shows progression of lines gathered using scrapers. It is also normalized
by the most efficient data point.

Figure (3.5)

20
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

3.3 Scraping under the radar

Automated scraping using your own ’bot’ violates almost every single search engine’s
terms of service especially if your bot doesn’t even look at their robot.txt file which
outlines the terms and rules that bots must follow when using the site. ”The search
engines aren’t naive. Google knows every search engine operator scrapes its results. So
does Bing. Both engines have to decide when to block a scraper”[42]. After some testing
the equation of decisions of whether to block or not block a scraper becomes clear.

• The potential load put on the server by the scraper

• The potential load put on the server because it needs to block the scraper

[42] It basically comes down to being on the right side of these two parameters. Even if
you are inevitably found out as a bot, your scraper should be so unobtrusive that it is
not worth the sites time to waste CPU cycles on you. Humans don’t click on the different
links within milliseconds of each request at a speed that is almost unfeasible. Making
the scrapers as human as possible increases the probability of not getting stopped or
IP banned.The main two tune-able parameter that each of my scrapers used to disguise
themselves are. TD - ’Threshold Delay’ and RID - ’Random Incremental Delay’. TD
is a delay that is constant and stops the scraper from making requests too quickly
in succession. RID is to give the scraper random elements of time delay to give the
illusion of being a human. These parameters are tune-able if you have a server which
can dynamically change its IP address, a tunneled VPN or some way of avoiding a
ban if you get caught too many times. The tuning process involves taking in scraper
information as well as looking at logs of the interactions of your scrapers and altering
the appropriate parameter within the scraper. furthermore, using a random user agent
within the header files of each request from a large array of devices can increase your
probability of going undetected drastically.

3.4 Maintenance

On my server I had upwards of five to seven scrapers running at any one time. This
meant that I was constantly checking their progress and fixing bugs that would arise
quite often. I implemented a simple tracking system so that if any of the scrapers was
cut short on execution, it would give information on the error, the id of the last URL
that was accessed and the current amount of entries scraped. This allowed for easily
accessing, debugging and resetting of the scraper to pick up where it left off. With a
small server it can be difficult to organize resources to be as efficient as possible,
21
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

3.5 Parsing & Removal of Impurities

BIBTEX is a format which is relatively simple to parse in most languages. The issue lies
in entries containing special characters which will affect a parsers interpretation of an
entry, which in turn will produce errors. There exists many ’Bib cleaners’ which attempt
to clean BIBTEX databases by removing duplicates, checking that the entries contain the
correct fields for that type of entry and call attention to infractions in the bibliography’s
which may affect parsing. Random manual inspection can be useful for getting a ’taste’
for what the data consists of and if there may be possible problems with a conversion.
A field within the DBLP scraper data contained syntax that was reserved for BIBTEX
entries which would affect the conversion when a parser made attempts on it. Since the
field was not one of major concern such as ’author’ or ’title’ I removed the field from
ever entry it existed in to alleviate the issue.

3.5.1 The ’HomePage’ problem

Within the DBLP scraper I noticed a large number of similar disparities which had
accumulated due to the @misc field type for BIBTEX entries.All BIBTEX field types can be
seen in the B appendix. Misc in particular is the shortening of miscellaneous and is used
when the type of the entry is not apparent. Within DBLP there contains thousands of
entries for people who have homepages in BIBTEX format which does not contain useful
fields for the training these algorithms, especially considering the misc type does not
require any fields to be valid.

22
Chapter 4

Conversion of BibTeX & Building


the Data-Set

4.1 Citation Styles

When referencing research papers and other scholarly articles, particular styles are used
depending on the preference of the university or department. This is why citation
parsing for meta-data generation is such a complex issue, of course if the parser knew
which citation style is being used it could reverse engineer the reference to get back the
correct fields. This is because of the way in which references are created, the details of
each style are contained in CSL files. Which are publicly available at [43]. These CSL
files contain an XML schema for each style that can be used to convert BIBTEX to a
bibliographic reference which can be inserted into the paper. For each BIBTEX entry that
was scraped, that information is used alongside each of the thousand or so CSL styling
files in order to generate appropriate bibliographic reference strings. Meaning that a
single BIBTEX entry within the database will contain citation information for every style
that was used in the conversion. These bibliographies are needed so that base truth
values can be obtained when evaluating the ’out-of-the-box’ results, these results will be
put in contrast to the trained versions of the tools.

4.2 Annotating Citation References

Annotated Citation References are references which contain meta-data fields in XML
format which outline the fields that belong to the associated reference tokens. In order
to create these annotated references with the correct corresponding fields or labels, the

23
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

CSL files need to be altered in order to make the additions of the XML formatted tags
which encompass the different tokens of the reference string. Manually altering over a
thousand CSL files to reflect these changes is unfeasible, and such is why I begin writing
scripts to automate this problem. One of the main problems with automating this
process is the need for the tags to be accurate and since these CSL files are not always
uniform in their layout XML, it makes for quite a difficult scheme of finding where to
place tags within it. In order to write scripts for this, I chose to copy the appropriate
tag name of what the CSL document was referring to that segment as, and add it to the
correct position within the XML tags. This is why the uniformity of the field labels of
the BIBTEX analogous to the labels of the annotated references was such an issue.

<author><surname>Shachter</surname> <firstname>R. D.</firstname>,


<surname>Levitt</surname> <firstname>T. S.</firstname>,
<surname>Kanal</surname> <firstname>L. N.</firstname> &
<surname>Lemmer</surname> <firstname>J. F.</firstname> (eds)</author>
<issued> 1990. </issued> <edition>UAI 88: Proceedings of the Fourth Annual
Conference on Uncertainty in Artificial Intelligence, Minneapolis, MN,
USA, July 10-12, 1988</edition><publisher>North-Holland</publisher> .

4.3 CiteProc Software & Efficiency Challenges

CiteProc-java is a Citation Style Language (CSL) processor for Java, it interprets and
translates CSL styles and generates citations and bibliographies [24]. Citeproc uses a
CSL processor in order to make the conversion to a citation. Initializing the processor
takes the style as a parameter, which means that you cannot use the same processor
and change the style parameter for each case. This brings about massive inefficiency
challenges as well as questions relating to the order of operations to be carried out for
the conversion. There are two ways of looping the order of execution, one option is
to initialize every processor for every style that is needed for conversion and iterate
through each BIBTEX entry object. For each entry, use the processor of each style to
convert to its associated citation/reference. This will include the altered Annotated CSL
files as well. This way, although it is expensive to initialize all of the processors in the
beginning and each of the processors takes up a large chunk of RAM each, since the same
object is being updated with the citations values it means that it will most likely be
cached and it will be faster than having to continually be pulling a new object from the
database for each style insertion. The other approach is for each style, iterate through
each object and make the conversion. With this approach it means that initializing each
style processor may be expensive, it means that the overall amount of RAM used at any

24
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

given time is reduced by the amount of style processors. After testing both approaches
with one iteration, I found that my server did not have enough RAM to hold all of the
CSL processors at once. This forced the first approach to go into action, as the latter
was unfeasible in the current state of affairs of my server.

The BIBTEX text files needed to be split into smaller subsets of files because when the
conversion process begins, it reads the BIBTEX file and parses the entire file putting it
into a BIBTEX database object in RAM. If the file was too large, memory leaks would
occur which is why they needed to be split. This became an advantage anyway, due to
the fact it made it easier to write parallel code for completing the conversion.

4.4 Structure of the Data-set

The structure of the data-set is dictated by the fields and information needed in order to
train and evaluate existing citation tools which are lacking in regards to the extensiveness
and diversity of its training data. The BIBTEX information at the top of each entry is
what is used to create the reference strings which are contained in the ’Citation’ array,
each of a different style. The ’bibRef’ is short for bibliographic reference and is the
original unaltered reference. The ’AnnoRef’ is the annotated version of the original
reference which will be used for training a model with the currently available machine
learning tools. The Data-set itself is in the form of csv/json, the data was first stored
in a Mongo database.

25
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

This shows the schema of the database, with ”...” indicating more values of similarity
continuing and quotes indicating values of fields.

// One Entry in JSON format


{
// BibTeX Information
_id : " ",
Title : " ",
Author : " ",
Volume : " ",
Pages : " ",
...
// Citation/Reference Information
Citation : [
{
Style : " ",
bibRef : " ",
AnnoRef : " "
}
{
...
}
...
]
}

4.5 Meta-Data Field consistency & Noise

The need for clean, consistent and clear meta-data fields are necessary when making
string comparisons between both the meta-data fields themselves and the inner values
of the fields. This means that inaccuracies of meta-data fields can impact assessment
percentages of parsing tools. A simple example of this is if the BIBTEX entry contains
a field ”Pages” and when the entry is converted to a annotated citation the entry for
”Pages” has gotten a different field name from the original such as ”Locality”.

26
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

4.6 Results

The overall scraping resulted in 7-8 text files containing an estimated 2.5 million BIBTEX
entries in total. Which when multiplied by the number of styles I have prepared for an-
notation (1600) gives 4 billion possible instances over 2.5 million unique reference strings
Currently the data-set contains approximately 20 million instances of 400,000 unique ref-
erence strings which were converted using unique styles. Only 50 styles were used in the
conversion because of the way in which the data was being converted, which I mentioned
in this chapter. Since one style had to go through every BIBTEX entry in that particular
file, It meant that currently only 50 styles have been used in total for the instances of
references in the data set. Putting the order of execution in the opposite fashion was
unfeasible, but would of used all 1600 styles on one entry at a time. 7-8GB of RAM was
being chunked away at in order to have concurrent updating on different ports. Between
the Data that was stored from BIBTEX entries, code, resources for styles/DBconnectors
and the data set entries, Over 70GB of disk was being used up. Until the point of
when errors appear because tab completion could not execute due to lack of memory.
A confusion matrix is regularly used to depict the performance of a machine learning
model. Precision is the ratio of correctly predicted positive observations to the total
predicted positive observations, high precision rates correlates with a low false positive
rate. The question that precision answers is, of the fields that we labeled, how many of
those that were labelled were correct? Recall, also referred to as sensitivity is the ratio of
correctly predicted positive observations to all the observations within the classification
class. recall explains the question, of the ground truth fields, how many did the tools
output were labeled. F1 score is the weighted average of both precision and recall. This
can be better than accuracy if the false positives and false negatives rates are drastically
different to each other. Accuracy works best when false negatives and false positives are
relatively close in score or have a similar cost. This is also referred to as having a ’even
class distribution’.

In table 4.1, This shows the number of fields overall contained in each of the scrapers
BIBTEX entries. CCS and DBLP appear to have the most diverse fields, with almost
no 0’s contained in each. IEEE and TeXMed appear to have very focused sets of data,
which when visually inspected is absolutely true, with the majority of their BIBTEX
entries being exactly 10 lines containing the exact same fields for each. The sheer size
difference of CCS,DBLP and the rest of the scrapers may be a more in tune reason
for the diversity but manual inspection reveals the consistency of IEEE and TexMed
BIBTEX

In table 4.2, This shows the types of BIBTEX entries that were gathered from each
scraper.
27
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Table 4.1: Number of Fields in BIBTEX Entries for each Scraper

ACM CCS IEEE texmed DBLP


address 233790 21897 0 0 51
annote 0 21681 0 0 0
author 323633 390186 143981 173744 701802
booktitle 63273 161895 105049 0 401825
chapter 24556 186 0 0 17
crossref 0 45388 0 0 0
edition 12 537 0 0 15
editor 51950 66112 0 0 199695
howpublished 0 408 0 0 0
institution 0 5355 0 0 60
journal 233794 186669 65558 173744 294255
key 1452 2286 0 0 335
month 210522 78808 168872 128940 20
note 13447 36114 0 0 47
number 232433 155326 64793 164979 227003
organization 0 1557 0 0 175
pages 295327 281575 167902 173744 636320
publisher 353843 137330 0 0 341704
school 0 14838 0 0 4575
series 50239 30282 0 0 94151
title 393858 557415 275656 173744 1050427
type 1 17166 0 0 200
volume 232975 210201 82582 165769 387578
year 353860 388518 170607 173744 1050460

Table 4.2: Types of BIBTEX Entries for each Scraper

ACM CCS IEEE TexMed DBLP


article 233790 200694 65664 173123 294244
book 16176 9762 0 0 5310
booklet 0 21 0 0 0
conference 0 63 0 0 0
inbook 0 126 0 0 0
incollection 24556 5826 0 0 5130
inproceedings 63273 113877 105058 0 396695
manual 0 528 0 0 0
mastersthesis 190 6690 0 0 0
misc 0 36024 0 0 0
phdthesis 12473 5256 0 0 4371
proceedings 2208 3288 0 0 344676
techreport 1185 6798 0 0 0
unpublished 0 708 0 0 0

28
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

In figure 5.1, it displays the number of each field within each scraper. The most fields
gathered are the most important for citation impact calculations as well as recommen-
dations for other scholarly articles usually, although the fact that ”year” is higher than
author, leads me to believe their may be a lot of conferences within DBLP which do not
contain authors but rather event details. Everything else seems to be as expected with
”pages” , ”journal”, ”volume” ”publisher” and ”number” all having medium to high
results.

In figure 4.2, it depicts the number of types of BIBTEX entries within each scraper. The
majority of the dataset is made up of ”articles”, ”inproceedings” and ”proceedings”. I
believe the large amount of ”proceedings” in DBLP relates to the lack of ”author” fields
that were collected in comparison to the ”year” field.

29
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Figure 4.1: Number of Fields in BIBTEX entries, distributed over the different scrapers

Figure (5.1)

Figure 4.2: Number of Types of BIBTEX entries, distributed over the scrapers

Figure (4.2)

30
Chapter 5

Future Work & Data-Set


Usability

This Data-set will be used for training citation parsing tools which feature machine
learning based approaches. The tools will learn custom parsing rules from this training
data. To mention a few of the open sources tools, CERMINE [3] , GROBID [44], ParsCit
[10].

5.1 The Process of Evaluation Testing

’Out of the box’ versions of the tools contain pre-trained machine learning models which
are used if no other model is provided. The comprehensive data-set is split into 66%
training data and 33% evaluation sets. A lot of the tools ask for the annotated references
to be formatted in a particular way before being processed. Usually it is just newline
delimited. First the evaluation section of the set must be tested using each tool to get
initial values for ’out of the box’ performance. These results will be put in comparison
to later results from the trained models using confusion matrices or basic accuracy
percentages. The same tool is then trained using the training data and their accuracy’s
are put into contrast in regards to ”precision” , ”recall” and ”F1 score”.

31
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Figure 5.1: Shows the dissection and pattern matching of a typical research article

Figure 5.1 depicts a visual representation of the tagging and pattern matching that
happens when parsing a scientific research document, showing the field labels on the
right legend [45].

32
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

5.1.1 Comparison Method

For each tool, the given output fields that were extracted for each reference are compared
against the annotated values of the same reference stored within the data-set which are
referred to as the ’ground truth’ values. The output fields are altered slightly in order to
clean them, which includes cleaning of special characters and escape characters, conver-
sion to lowercase and adjusting of some intermittent characters such as hyphens. After
subsequent cleaning, true and false values are attached to each extracted field based on
string comparisons between cleaned extraction fields and ground truth values. In order
for an extracted field to be considered correct its type and value must be equal to one
of the fields in the ground truth references. Removal of edge punctuation and variable
elements of the string such as commas and ampersands can help ease the comparison
process, we don’t particularly care if the style uses and & or and as that is not what is
important within the extraction process.

figure (6) - [37] - Shows string comparison results relating to the length and manner of
each as a percentage.

In figure (6), recognized means that the strings were equal and a 1 was assigned. Su-
perstring implies that the tools output value contained the ground truth value but also
contained extra elements not contained in the ground truth value. Substring implies the
tools output value was contained in the ground truth value, but it was not the complete
or entirety of the ground truth string.

33
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

Another interesting way of finding the accuracy’s of the outputted meta-data from the
tools can be used by implementing a Levenshtein distance approach alongside a reason-
able threshold being applied to get a correct or incorrect output. The reason for using
the word interesting being that upon inspection of accuracy results, it can indicate or
otherwise pinpoint issues with the tools inaccuracies which may relate to particular
styles. It can also illustrate the sections of the reference which is being mistake based
of the percentages given for the accuracy of each field extracted.

5.2 Outlook

The scraping was cut short because of time constraints on the project. Additionally the
scarcity of available disk and memory on my server was a contributing factor. My server
was running out of disk due to the scale of these entries and the amount of data that
was being stored in text files from the BIBTEX entries. I underestimated the amount of
RAM that Mongo would use in order to make fast updates to the DB and thus I could
not concurrently run 4 of the conversions at the same time and eventually couldn’t even
run 1 DB without having memory issues. Since their are over 2 million BIBTEX entries,
at least 1600 configured styles for conversion, this brings the number of instances of
converted reference strings up to 4 billion. In future, I would need to gather enough
concurrent resources to make these conversions faster and complete. Especially under the
time constraints it was rather difficult to make the conversions go any faster considering
it was converting, updating and storing 10 entries per second. Which even under the
hindrance of having a small server and concurrently converting with 4 instances, still
comes out to about 1000 days to convert the entire BIBTEX set.

5.3 Conclusion

Through research of countless scientific papers in this area and maneuvering through
the struggles of scraping, I can see why open source data sets for citation parsing tools
are sparse. Regardless of this, I am positive that this data set will continue to grow
and increase performance for open source meta data extraction tools based on results
from [2], showing that retraining already trained tools can give promising results even
with moderately sized data sets. Since this data set will become much larger over
time, I predict that performance will grow with its size, until a threshold of reasonable
uncertainty and the need for artificial noise and human error will be needed in order to
train these tools further.

34
Appendix A

Terms

BIBTEX - Bibliography Type Setting


DBLP - The dblp computer science bibliography is the on-line reference for open bibli-
ographic information on computer science journals and proceedings
CCS - The Collection of Computer Science Bibliographies
PubMed - PubMed is a free search engine accessing primarily the MEDLINE database
of references
TexMed - a BibTeX interface for PubMed
ACM - The Association for Computing Machinery is an international learned society for
computing
IEEE - The Institute of Electrical and Electronics Engineers is a professional association

35
Appendix B

Information on BibTeX

These are the standard entry types of BibTeX entries and the details of
what is required or optional for each

• article
An article from a journal or magazine. Required fields: author, title, journal, year.
Optional fields: volume, number, pages, month, note.

• book
A book with an explicit publisher. Required fields: author or editor, title, pub-
lisher, year. Optional fields: volume or number, series, address, edition, month,
note.

• booklet
A work that is printed and bound, but without a named publisher or sponsoring
institution. Required field: title. Optional fields: author, howpublished, address,
month, year, note.

• conference
The same as INPROCEEDINGS, included for Scribe compatibility. inbook A
part of a book, which may be a chapter (or section or whatever) and/or a range
of pages. Required fields: author or editor, title, chapter and/or pages, publisher,
year. Optional fields: volume or number, series, type, address, edition, month,
note.

36
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

• incollection
A part of a book having its own title. Required fields: author, title, booktitle,
publisher, year. Optional fields: editor, volume or number, series, type, chapter,
pages, address, edition, month, note.

• inproceedings
An article in a conference proceedings. Required fields: author, title, booktitle,
year. Optional fields: editor, volume or number, series, pages, address, month,
organization, publisher, note.

• manual
Technical documentation. Required field: title. Optional fields: author, organiza-
tion, address, edition, month, year, note.

• mastersthesis
A Master’s thesis. Required fields: author, title, school, year. Optional fields:
type, address, month, note.

• misc
Use this type when nothing else fits. Required fields: none. Optional fields: author,
title, howpublished, month, year, note.

• phdthesis
A PhD thesis. Required fields: author, title, school, year. Optional fields: type,
address, month, note.

• proceedings
The proceedings of a conference. Required fields: title, year. Optional fields:
editor, volume or number, series, address, month, organization, publisher, note.

• techreport
A report published by a school or other institution, usually numbered within a
series. Required fields: author, title, institution, year. Optional fields: type,
number, address, month, note.

• unpublished
A document having an author and title, but not formally published. Required
fields: author, title, note. Optional fields: month, year.

37
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

[46]

• address
Usually the address of the publisher or other type of institution. For major publish-
ing houses, van Leunen recommends omitting the information entirely. For small
publishers, on the other hand, you can help the reader by giving the complete
address.

• annote
An annotation. It is not used by the standard bibliography styles, but may be
used by others that produce an annotated bibliography.

• author
The name(s) of the author(s), in the format described in the LATEX book.

• booktitle
Title of a book, part of which is being cited. See the LATEX book for how to type
titles. For book entries, use the title field instead.

• chapter
A chapter (or section or whatever) number.

• crossref
The database key of the entry being cross referenced.

• edition
The edition of a book–for example, “Second”. This should be an ordinal, and
should have the first letter capitalized, as shown here; the standard styles convert
to lower case when necessary.

• editor
Name(s) of editor(s), typed as indicated in the LATEX book. If there is also an
author field, then the editor field gives the editor of the book or collection in which
the reference appears.

• howpublished
How something strange has been published. The first word should be capitalized.
38
Citation Data-set for Machine Learning Citation Styles Niall Martin Ryan

• institution
The sponsoring institution of a technical report.

• journal
A journal name. Abbreviations are provided for many journals; see the Local
Guide.

• key
Used for alphabetizing, cross referencing, and creating a label when the “author”
information (described in Section 4) is missing. This field should not be confused
with the key that appears in the cite command and at the beginning of the database
entry.

• month
The month in which the work was published or, for an unpublished work, in which
it was written. You should use the standard three-letter abbreviation, as described
in Appendix B.1.3 of the LATEX book.

• note
Any additional information that can help the reader. The first word should be
capitalized.

• number
The number of a journal, magazine, technical report, or of a work in a series. An
issue of a journal or magazine is usually identified by its volume and number; the
organization that issues a technical report usually gives it a number; and sometimes
books are given numbers in a named series.

• organization
The organization that sponsors a conference or that publishes a manual.

• pages
One or more page numbers or range of numbers, such as 42-111 or 7,41,73-97 or
43+ (the ‘+’ in this last example indicates pages following that don’t form a simple
range). To make it easier to maintain Scribe-compatible databases, the standard
styles convert a single dash (as in 7-33) to the double dash used in TEX to denote
number ranges (as in 7-33).

39
Bibliography Niall Martin Ryan

• publisher
The publisher’s name.

• school
The name of the school where a thesis was written.

• series
The name of a series or set of books. When citing an entire book, the the title field
gives its title and an optional series field gives the name of a series or multi-volume
set in which the book is published.

• title
The work’s title, typed as explained in the LATEX book.

• type
The type of a technical report–for example, “Research Note”.

• volume
The volume of a journal or multivolume book.

• year
The year of publication or, for an unpublished work, the year it was written.
Generally it should consist of four numerals, such as 1984, although the standard
styles can handle any year whose last four nonpunctuation characters are numerals,
such as ‘(about 1984)’.

[46]

40
Bibliography

[1] Joeran Beel, Bela Gipp, Stefan Langer, and Corinna Breitinger. Research paper rec-
ommender systems: A literature survey. International Journal on Digital Libraries,
(4):305338, 2016. doi: 10.1007/s00799-015-0156-0.

[2] Dominika Tkaczyk, Andrew Collins, Paraic Sheridan, and Joeran Beel. Machine
learning vs. rules and out-of-the-box vs. retrained: An evaluation of open-source
bibliographic reference and citation parsers. ACM Joint Conference on Digital
Libraries.

[3] Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek, and
Lukasz Bolikowski. CERMINE: automatic extraction of structured metadata from
scientific literature. IJDAR, 18(4):317–335, 2015. doi: 10.1007/s10032-015-0249-8.
URL https://doi.org/10.1007/s10032-015-0249-8.

[4] Dominika Tkaczyk, Pawel Szostek, Piotr Jan Dendek, Mateusz Fedoryszak, and
Lukasz Bolikowski. CERMINE - automatic extraction of metadata and references
from scientific literature. In Jean-Yves Ramel, Marcus Liwicki, Jean-Marc Ogier,
Koichi Kise, and Ray Smith, editors, 11th IAPR International Workshop on Docu-
ment Analysis Systems, DAS 2014, Tours, France, April 7-10, 2014, pages 217–221.
IEEE Computer Society, 2014. ISBN 978-1-4799-3244-3. doi: 10.1109/DAS.2014.63.
URL https://doi.org/10.1109/DAS.2014.63.

[5] Dominika Tkaczyk and Lukasz Bolikowski. Extracting contextual information from
scientific literature using CERMINE system. In Fabien Gandon, Elena Cabrio,
Milan Stankovic, and Antoine Zimmermann, editors, Semantic Web Evaluation
Challenges - Second SemWebEval Challenge at ESWC 2015, Portorož, Slovenia,
May 31 - June 4, 2015, Revised Selected Papers, volume 548 of Communications

41
Bibliography Niall Martin Ryan

in Computer and Information Science, pages 93–104. Springer, 2015. ISBN 978-3-
319-25517-0. doi: 10.1007/978-3-319-25518-7 8. URL https://doi.org/10.1007/
978-3-319-25518-7_8.

[6] Dominika Tkaczyk, Pawel Szostek, and Lukasz Bolikowski. Grotoap2the methodol-
ogy of creating a large ground truth dataset of scientific articles. D-Lib Magazine,
20(11/12), 2014.

[7] C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. Citeseer: an automatic
citation indexing system. In INTERNATIONAL CONFERENCE ON DIGITAL
LIBRARIES, pages 89–98. ACM Press, 1998.

[8] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore.
Automating the construction of internet portals with machine learning. Information
Retrieval, 3(2):127–163, 2000.

[9] Pubmed. URL http://www.ncbi.nlm.nih.gov/pubmed.

[10] Isaac G Councill, C Lee Giles, and Min-Yen Kan. Parscit: an open-source crf
reference string parsing package. In LREC, volume 8, pages 661–667, 2008.

[11] Mateusz Fedoryszak, Dominika Tkaczyk, and Lukasz Bolikowski. Large scale cita-
tion matching using apache hadoop. In Trond Aalberg, Christos Papatheodorou,
Milena Dobreva, Giannis Tsakonas, and Charles J. Farrugia, editors, Research and
Advanced Technology for Digital Libraries - International Conference on Theory
and Practice of Digital Libraries, TPDL 2013, Valletta, Malta, September 22-26,
2013. Proceedings, volume 8092 of Lecture Notes in Computer Science, pages 362–
365. Springer, 2013. ISBN 978-3-642-40500-6. doi: 10.1007/978-3-642-40501-3 37.
URL https://doi.org/10.1007/978-3-642-40501-3_37.

[12] Mateusz Fedoryszak, Lukasz Bolikowski, Dominika Tkaczyk, and Krzysztof Woj-
ciechowski. Methodology for evaluating citation parsing and matching. In Robert
Bembenik, Lukasz Skonieczny, Henryk Rybinski, Marzena Kryszkiewicz, and Marek
Niezgodka, editors, Intelligent Tools for Building a Scientific Information Platform
- Advanced Architectures and Solutions, volume 467 of Studies in Computational
Intelligence, pages 145–154. Springer, 2013. ISBN 978-3-642-35646-9. doi: 10.1007/
978-3-642-35647-6 11. URL https://doi.org/10.1007/978-3-642-35647-6_11.

42
Bibliography Niall Martin Ryan

[13] Joeran Beel, Bela Gipp, Stefan Langer, Marcel Genzmehr, Erik Wilde, Andreas
Nrnberger, and Jim Pitman. Introducing Mr. DLib, a Machine-readable Digital Li-
brary. In Proceedings of the 11th ACM/IEEE Joint Conference on Digital Libraries
(JCDL‘11), pages 463–464. ACM, 2011. doi: 10.1145/1998076.1998187.

[14] Joeran Beel, Akiko Aizawa, Corinna Breitinger, and Bela Gipp. Mr. dlib:
Recommendations-as-a-service (raas) for academia. In Proceedings of the
ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pages 1–2, 2017.
doi: 10.1109/JCDL.2017.7991606.

[15] Xiaoli Zhang, Jie Zou, Daniel X Le, and George R Thoma. A structural svm
approach for reference parsing. BMC bioinformatics, 12(3):S7, 2011.

[16] Jian Wu, Kyle Williams, Hung-Hsuan Chen, Madian Khabsa, Cornelia Caragea,
Alexander Ororbia, Douglas Jordan, and C. Giles. Citeseerx: Ai in a digital library
search engine, 2014. URL https://www.aaai.org/ocs/index.php/IAAI/IAAI14/
paper/view/8607.

[17] Mario Lipinski, Kevin Yao, Corinna Breitinger, Joeran Beel, and Bela Gipp. Eval-
uation of header metadata extraction approaches and tools for scientific pdf doc-
uments. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital
Libraries, JCDL ’13, pages 385–386, New York, NY, USA, 2013. ACM. ISBN
978-1-4503-2077-1. doi: 10.1145/2467696.2467753. URL http://doi.acm.org/
10.1145/2467696.2467753.

[18] Jiangde Yu and Xiaozhong Fan. Metadata extraction from chinese research papers
based on conditional random fields. In Fourth International Conference on Fuzzy
Systems and Knowledge Discovery (FSKD 2007). IEEE, 2007. doi: 10.1109/fskd.
2007.394. URL https://doi.org/10.1109/fskd.2007.394.

[19] Qing Zhang, Yong-Gang Cao, and Hong Yu. Parsing citations in biomedical articles
using conditional random fields. Computers in Biology and Medicine, 41(4):190–
194, apr 2011. doi: 10.1016/j.compbiomed.2011.02.005. URL https://doi.org/
10.1016/j.compbiomed.2011.02.005.

[20] Sein Lin, Jun-Ping Ng, Shreyasee Pradhan, Jatin Shah, Ricardo Pietrobon, and
Min-Yen Kan. Extracting formulaic and free text clinical research articles metadata
using conditional random fields. In Proceedings of the NAACL HLT 2010 Second
43
Bibliography Niall Martin Ryan

Louhi Workshop on Text and Data Mining of Health Documents, Louhi ’10, pages
90–95, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
URL http://dl.acm.org/citation.cfm?id=1867735.1867749.

[21] David Pinto, Andrew McCallum, Xing Wei, and W Bruce Croft. Table extraction
using conditional random fields. In Proceedings of the 26th annual international
ACM SIGIR conference on Research and development in informaion retrieval, pages
235–242. ACM, 2003.

[22] F Peng and A McCallum. Accurate information extraction from research papers
using conditional random fields. retrieved on april 13, 2013.

[23] Tudor Groza, Gunnar AAstrand Grimnes, and Siegfried Handschuh. Reference
information extraction and processing using conditional random fields. Information
Technology and Libraries (Online), 31(2):6, 2012.

[24] Michel Krämer. citeproc-java: A citation style language (csl) processor for java.
URL http://michel-kraemer.github.io/citeproc-java/.

[25] Liangcai Gao, Xixi Qi, Zhi Tang, Xiaofan Lin, and Ying Liu. Web-based citation
parsing, correction and augmentation. In Proceedings of the 12th ACM/IEEE-CS
joint conference on Digital Libraries, pages 295–304. ACM, 2012.

[26] Robert D Cameron. A universal citation database as a catalyst for reform in schol-
arly communication. First Monday, 2(4), 1997.

[27] Steve Lawrence, C Lee Giles, and Kurt Bollacker. Digital libraries and autonomous
citation indexing. Computer, 32(6):67–71, 1999.

[28] Eli Cortez, Altigran S da Silva, Marcos André Gonçalves, Filipe Mesquita, and
Edleno S de Moura. Flux-cim: flexible unsupervised extraction of citation metadata.
In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pages
215–224. ACM, 2007.

[29] Michael Jewell. Paracite: An overview, 2000.

[30] Ping Yin, Ming Zhang, ZhiHong Deng, and DongQing Yang. Metadata extrac-
tion from bibliographies using bigram hmm. In International Conference on Asian
Digital Libraries, pages 310–319. Springer, 2004.

44
Bibliography Niall Martin Ryan

[31] Erik Hetzner. A simple method for citation metadata extraction using hidden
markov models. In Proceedings of the 8th ACM/IEEE-CS joint conference on Dig-
ital libraries, pages 280–284. ACM, 2008.

[32] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random
fields: Probabilistic models for segmenting and labeling sequence data. 2001.

[33] Elyssa Bradham. Hidden Markov Model and some applications in hand-
writing recognition Published byElyssa Bradham Modified over 3 years ago
52 Embed Download presentation Presentation on theme: ”Hidden Markov
Model and some applications in handwriting recognition” Presentation tran-
script: 1 Hidden Markov Model and some applications in handwriting
recognition. URL http://slideplayer.com/slide/1416164/4/images/17/
HiddenMarkovModelStartRainySunnyWalkShopCleanExample(cont):.jpg.

[34] Patrice Lopez and Laurent Romary. GROBID - informa-


tion extraction from scientific publications. ERCIM News, 2015
(100), 2015. URL http://ercim-news.ercim.eu/en100/r-i/
grobid-information-extraction-from-scientific-publications.

[35] Nguyen Viet Cuong, Muthu Kumar Chandrasekaran, Min-Yen Kan, and Wee Sun
Lee. Scholarly document information extraction using extensible features for effi-
cient higher order semi-crfs. In Paul Logasa Bogen II, Suzie Allard, Holly Mer-
cer, Micah Beck, Sally Jo Cunningham, Dion Hoe-Lian Goh, and Geneva Henry,
editors, Proceedings of the 15th ACM/IEEE-CE Joint Conference on Digital Li-
braries, Knoxville, TN, USA, June 21-25, 2015, pages 61–64. ACM, 2015. ISBN
978-1-4503-3594-2. doi: 10.1145/2756406.2756946. URL http://doi.acm.org/10.
1145/2756406.2756946.

[36] Dominika Tkaczyk, Bartosz Tarnawski, and Lukasz Bolikowski. Structured


affiliations extraction from scientific literature. D-Lib Magazine, 21(11/12),
2015. doi: 10.1045/november2015-tkaczyk. URL https://doi.org/10.1045/
november2015-tkaczyk.

[37] Dominika Tkaczyk, Pawel Szostek, Piotr Jan Dendek, Mateusz Fedoryszak, and
Lukasz Bolikowski. Cermine–automatic extraction of metadata and references from

45
Bibliography Niall Martin Ryan

scientific literature. In Document Analysis Systems (DAS), 2014 11th IAPR Inter-
national Workshop on, pages 217–221. IEEE, 2014.

[38] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness
of data. IEEE Intelligent Systems, 24(2):8–12, 2009.

[39] Kristie Seymore, Andrew McCallum, and Roni Rosenfeld. Learning hidden markov
model structure for information extraction. In AAAI-99 workshop on machine
learning for information extraction, pages 37–42, 1999.

[40] Joeran Beel and Bela Gipp. On the robustness of google scholar against spam. In
Proceedings of the 21st ACM Conference on Hypertext and Hypermedia (HT’10),
pages 297–298, Toronto (CA), June 2010. ACM. doi: 10.1145/1810617.1810683.

[41] Joeran Beel and Bela Gipp. Academic search engine spam and google scholar’s
resilience against it. Journal of Electronic Publishing, 13(3), #dec# 2010. doi:
10.3998/3336451.0013.305.

[42] How to: Scrape search engines without pissing them off, Feb
2017. URL https://searchnewscentral.com/blog/2011/09/28/
how-to-scrape-search-engines-without-pissing-them-off/.

[43] Citation style language. URL http://citationstyles.org/.

[44] Patrice Lopez. Grobid: Combining automatic bibliographic data recognition and
term extraction for scholarship publications. In International Conference on Theory
and Practice of Digital Libraries, pages 473–474. Springer, 2009.

[45] Hui Wei, Shaopeng Wu, Youbing Zhao, Zhikun Deng, Nikolaos Ersotelos, Farzad
Parvinzamir, Baoquan Liu, Enjie Liu, and Feng Dong. Data mining, management
and visualization in large scientific corpuses, 04 2016.

[46] Bibtex entry types, field types and usage hints. URL https://www.openoffice.
org/bibliographic/bibtex-defs.html.

46

You might also like