1
Domain-Specific Information Extraction Structures
Seamus Lyons, Dan Smith
School of Information Systems, University of East Anglia, Norwich UK
{S.Lyons, Dan.Smith}@uea.ac.uk
ABSTRACT
Information describing an event can frequently be found on several Web pages, each of
which is often poorly structured and incomplete; the set of pages is typically repetitive,
often contradictory and verbose. In order to deliver high quality information to a variety
of devices in different contexts and roles, we need to provide information that is more
succinct. Our approach to this problem is to use domain-specific templates to extract
information selectively from the original pages into XML frames, which act as a
canonical structure. The extracted information can then be transformed into a form
suitable for the intended application. A further series of output transformations are then
applied to format the information appropriately, e.g. as speech, simple text or published
on the web. We illustrate our approach with an application in reporting soccer matches.
1. Introduction
In this paper, we describe a mediation system
prototype that extracts information from
heterogeneous sources using libraries of domain-
specific template rules. The system allows for the
re-use of rules to aid generality whilst supporting
specialised functionality for domain-specific
solutions. The mediated data is represented in an
XML format. The system framework is
extensible to allow distribution to various media.
Presently, mediation systems are not well adapted
to serving the need for information customisation
[20]. For example, a user searching for stock
quotes, a soccer match score or traffic details
does not want to have to surf the Internet. With
the advances in technology, devices such as third
generation telephones and PDAs mean there are
new difficulties in providing information that is
appropriate to the users context (location, role,
connectivity, etc.).
Current web sites typically contain partially
relevant data that is displayed in a variety of
formats. Information has to be abstracted from
multiple heterogeneous sources and transformed
into a common format that can form the basis of
context-sensitive information delivery systems.
The widespread use of XML may alleviate some
of these problems, but cannot address many of
the problems in information extraction and
integration. We introduce an information
mediation system that extracts, translates and
distributes domain-specific information reliably
to a variety of media from web sources.
Information Extraction (IE) systems have focused
on the extraction of information for specific
purposes. This entails the formatting of the data
such that traditional database techniques can be
applied, or more sophisticated linguistic analysis
to extract the required information. These
approaches do not address several areas. First,
information that is held within a web document is
often poorly constructed, incomplete and
contradictory. Secondly, the same information is
required in a number of ways dependent on the
users requirements. Finally, the problem of
information overload is compounded by the
rapidly changing technical environment and
potential applications. A mediation system
therefore needs to integrate data of different
formats from multiple heterogeneous sources,
transform the data to an appropriate format and
be extensible to meet future needs.
The remainder of the paper is organised as
follows: section 2 describes the various systems
and approaches involved in the mediation process
in regard to related work. This is followed by a
description of the domain-specific template rules
in section 3. Section 4 describes the functionality
of the system architecture with the use a soccer
match report example. The results are described
in section 5. This is followed by our conclusions
and plans for future work.
2. Related Work
The extraction of information from semi-
structured documents has attracted substantial
attention. These have focused on one of two areas
of the mediation process: IE or Information
2
Integration. IE solutions have attracted the
attention of the database community where the
approach has led to the creation of wrappers that
use the metadata within the web document to
access the information. This approach is
successful in retrieving the underlying structure
of documents where achievable. The construction
of a wrapper for each source document is time-
consuming and difficult. Much research has
focused on the machine learning to overcome this
problem by wrapper generation [9], or wrapper
induction [4, 11].
Structure is not always present in a web
document. Bodies of free text with no relevant
metadata are contained in these documents. To
extract the pertinent information it is necessary to
gain an understanding of the text itself. These
systems include Autoslog [16], CRYSTAL [17],
RAPIER [2], WHISK [18], and STALKER [12].
These systems have the overhead of complex
processing that degrades the system efficiency at
extraction time. The essence of the problem is the
need for domain-specific knowledge to extract
the relevant information. Various systems, such
as CRYSTAL, rely on a tagged corpus to learn
the extraction template rules. However, the
dynamic nature of the web makes the
maintenance of the wrappers expensive with the
increase in precision for each domain resulting in
the loss of generality or the duplication of
similarity between domains.
The extraction of semi-structured data into a
structured format allows to information to be
stored such that traditional database techniques
can be applied. The querying of web data using
traditional data manipulation techniques assumes
a common schema between documents
[10,7,13,14,1,3]. This commonality between data
objects can then be used to integrate values that
are labelled as the same information. Duplication
issues are addressed in [19]. Finally, more
specific approaches to data integration include the
use of a context mediator [1,5], compact
skeletons [15] or the use of XML to mediate and
integrate data [8].
3. Domain-specific extraction
We assume that, at least for mediation and
extraction systems, knowledge is specific to one
or more domains and that domains are related in a
fuzzy hierarchy. These notions are implemented
through reusable libraries of rules and templates,
coupled to a hierarchy of extraction frames and
concept elements. This extensible approach
allows us to avoid the issues around the
incorporation of general knowledge into the
extraction system. The domain specific
knowledge we are interested in concerns
technical or specialised meanings, shorthand and
conventional phrasing to describe certain events
or series of events, idiomatic usage, etc. This
knowledge is essential for high-precision
information extraction.
The source-specific knowledge we require is
common to almost all wrapper systems, and
consists of information on query mechanisms,
navigation and updating policies. Updating
policies are important for our exemplar, as new
material is added as new documents, rather than
by replacing existing documents. Typically, the
information at each source is partial, so several
sources are required to provide all the
information we require. Sources vary in their
reliability for different aspects of the information
related to a domain and in the ease with which
information can be extracted.
The extraction rules are used to recognise concept
instances. A document may contain a number of
regions containing concepts of interest [21] which
are associated with sets of extraction rules to
identify and extract the information into
evaluation frames. Extraction rules are specific to
either a source or a domain, but are usually
combined in the initial extraction phase. The
initial extraction rules are based on regular
expressions, which can be specified through
several mechanisms. They incorporate a
probabilistic matching mechanism, which makes
them insensitive to many variations in wording,
layout and formatting. The coverage and
precision of the extraction is determined by
domain-specific terms. These terms are used to
determine the relevance of the text segment. A
list of synonyms for each term is defined with
thesaurus rule construction, if required. Column
headings or locating coordinates are specified for
parsing tabular data. The secondary extraction
step is to apply more computationally intensive
linguistic processing to the extracted information;
this is outside the scope of this paper.
The next phase in the process is to apply
transformation rules to normalise terminology,
units, etc. These rules are typically simple lookup
substitutions or transformations. The construction
of these rules is dependent on three properties:
the data type, the text format and the filtering
needed for the source text. A common format of
a concept is found on each source. This format
together with the required output format is
defined for the concept.
The third phase is to eliminate duplication arising
from using multiple sources in the extraction
phase. Often, the correct cardinality or range of
values for an element are known, in which case
out of range and duplicate values can be simply
discarded. In the case of conflicting information
the most trustworthy source is preferred. For
unconstrained items the problem is more difficult.
3
The nave solution is to accept all information
from the most trusted source, but this is
inadequate for most cases, where coverage is
incomplete. Currently we are experimenting with
an approach based on sentence similarity.
The canonical form obtained after these three
phases of processing forms is the starting point
for the processes involved in transforming the
information to provide an end-user (human or
application) with information appropriate to their
current situation, or context. The distribution
requirements of the mediated data to the end-user
are defined in libraries of output transformation
rules. These rules provide selection,
summarisation and translation to another format
or medium (e.g. speech). Figure 1 shows the
architecture of the mediation system.
4. Example: soccer reporting
In this section, we describe our prototype's
functionality through the example of a soccer
match report (Figure 2). Each match report
contains various levels of information required
for a fantasy football game. Initially we need to
extract details of the result and the participating
players. Secondly, we need to uniquely identify
each match report. This is achieved by extracting
match details such as the date, competition and
both teams involved. Finally, general details
about players in the game are located in the
match summary. These details include news of
injuries, cautions, etc. This area of the match
report is a body of free text and requires a deeper
analysis.
Statistical data is often published in tabular form.
The extraction engine parses this data using a
specialised set of rules. These work well for well-
structured tables, but perform poorly when tables
Template
Database
Filter Component
Data Normalization
Format Translator
Table Parser
Pattern Matcher
Token Identifier
World Wide Web
Extraction Engine
Value Transformer
Text-to-Speech
Output Generator
XML
Docs
Output Transformer
Figure 1 System Architecture
Concept Instance
Yellow cards
D.Sutch, 19, foul
M.Hughes, 4, foul
J.McAteer, 38, foul
M.Hughes, 45, second bookable
offence
Booked: Sutch.
Booked: Hughes, McAteer.
Concept Instance
Yellow cards
Figure 2 .Two match reports
4
the team names are used to infer home and away
players. In this instance, table co-ordinates are
used to determine which table the column is
contained within and the precise location of the
column.
Players details{ Pre-text ( m/Teams:/i
),
( m/\<B\>Teams\<\/B\>/i
),
Post-text
(
m/Att:|Ref:/i
) }
Figure 3. Player details extraction
Regular expression notation is used to describe
the format of the text of interest. Figure 3 shows
the identification pattern used to locate the
players details extraction frame. When the
extraction engine locates the pre- and post-
identification patterns the text is parsed into the
player details extraction frame. This may have
several identification patterns to identify the
relevant text. Figure 4 is an example of the
players details extraction frame. This results in
the list of players appearing, players not used,
yellow and red carded players, and goal scorers.
<P><B>Teams:</B>
<P><B>Norwich</B> Andy Marshall, Sutch,...
<P>Subs Not Used: Green, Walsh.
<P>Booked: Sutch.
<P>Goals: Forbes 32.
<P><B>Blackburn</B> Filan, Curtis, McAteer (Dailly 80),
...
<P>Subs Not Used: Short, Kelly.
<P>Sent Off: Hughes (45).
<P>Booked: Hughes, McAteer.
<P>Goals: Bent 64.
<P>Att: 16,695
(Mulryne 67), Llewellyn.
Figure 4. Extraction frame example
The extraction engine uses the identification
patterns to locate relevant information from
unstructured text. Initially the concept node
definition includes one or more patterns to
identify segments of the free text. Segment
boundaries are enforced with the use of pre and
post delimiters. For example, in Figure 5 the text
is segmented with <P> tags. Multiple definitions
of segment boundaries are used for system
robustness and flexibility in segment size. Term
identification is then used to identify relevant
segments from irrelevant segments of text.
Domain-specific terms or keywords are
associated to the concept node definition to
eradicate erroneous extraction from incorrect
areas of the document.
A player may change clubs, thus affecting his
ability to score points in the example game. The
transfer fee is one method of identifying that a
player has transferred teams. In Figure 5, the
terms forked out and 2million signify a
players transfer. To overcome this problem the
domain thesaurus is used to resolve synonymous
terms and phrases into to their appropriate
canonical form.
<P>Referee Paul Danson had no hesitation in producing the
second yellow card and Rovers looked to have a mountain to
climb.</P>
<P>But to their credit they came out in the second-half
looking the better side and deservedly drew level on 63
minutes as the impressive Bent showed why Graeme Souness
forked out 2million for his services.</P>
<P>The big striker picked up McAteer's ball in midfield and
used his strength to outmuscle two defenders before beating
Marshall with a low shot into the far corner.</P>
Figure 5. Fragment of match summary frame
The export schema defines the data types for each
concept slot. For example, the extracted text for a
team score is defined as an integer. Although this
data is generally in the correct format, it is
necessary to have a validation stage in which the
data type is checked. In Figure 6, the date for a
match played on 10th January 2001 is extracted
from source 1 and from source 2. To determine
the representation required for the production
schema the system calculates the day, month and
year for each source and normalises the data to
the required format. The mediated data conforms
to the output schema data model.
The output transformations are defined as XSL
programs. One of our prime targets for output
transformation is to repackage summary
information for spoken delivery, using a the
Festival TTS system [23].
Pre-transformation:
Match details
{source#1, [ date: 01/01/10, comp:Nationwide]}
{source#2, [ date: 10 January 2001, comp: D1 ]}
{source#3, [ date: 10-Jan-01 ]}
Post-transformation:
Match details
{source #1, [ date: 10/1/2001, comp: League ]}
{source #2, [ date: 10/1/2001, comp: League ]}
{source #2, [ date: 10/1/2001 ]}
Figure 6. Slots in the player details frame
5. Experimental Results
A corpus was collected from three sources
containing accounts of a group of soccer matches
played in January 2001. We performed two series
of experiments to measure the robustness and
effectiveness of the extraction process. The
purpose of the first series of experiments was
measure the extraction performance on the
structured elements of the match reports, shown
in Figure 7. The results show that almost all this
data was correctly extracted.
5
Recall (%)
Source #1 #2 #3 Total
match details 100.0 96.0 99.3 98.5
players details 97.0 99.1 98.6 98.0
match summary 85.3 84.4 - 85.1
Total 96.8 98.4 99.0 97.8
Precision (%)
Source #1 #2 #3 Total
match details 98.6 100.0 99.9 99.6
players details 99.3 100.0 82.8 99.7
match summary 85.3 80.6 - 84.1
Total 98.6 99.6 90.3 97.5
Figure 7. The extraction results
The second set of experiments was performed on
the match summary text, which consists of short
stylised paragraphs. Here the results are
dependent on the definition of the domain-
specific terms. We concentrated on extracting
injury and transfer information, with good results.
Observation of the data suggests that with a larger
sample, source #1 would give better results, as its
match summaries are more comprehensive.
The differences in coverage from one source to
another result in integration problems. Initially
we assumed the formatting of the critical data to
one common representation sufficient for our
purposes. In particular, we are aware that
differences in the use of names (e.g. players
referred to by description, position, nickname,
etc.) can substantially affect the extraction
performance. There are a number of issues in this
area that we have not yet addressed.
6 Conclusion
Although the development of our extensible
extraction system is at a comparatively early
stage, we believe it offers a good framework for
high precision context-aware information
delivery. The reusable rules libraries can be
coupled to extraction rule induction tools (e.g.
[22]). The core information extraction process
performs well for many domains and, although
we intend enhancing it with better linguistic
processing, the benefits of both the re-use of
general rules and the application of domain
knowledge are shown at this early stage of the
system implementation. The present schema
allows for reliable extraction across different text-
structure areas of a web document to produce a
unified domain representation. Future work will
focus on speech and other context-sensitive
information delivery services.
References
[1] S. Bressan, C-H. Goh, Answering Queries in
Context. FQAS 1998: 68-82
[2] M. E. Califf and R. J. Mooney. Relational Learning
of Pattern-Match Rules for Information Extraction,
Proc. AAAI-1999, 1999
[3] D. Florescu, A. Levy, A. Mendelson, Database
Techniques for the World Wide Web, ACM SIGMOD
Record, 27(3): 59-74 1998
[4] D. Freitag, N. Kushmerick. Boosted wrapper
induction. Proc. AAAI-2000, 577-583, 2000.
[5] C-H. Goh, S. Bressan, S. E. Madnick, M. D. Siegel,
Context Interchange: New Features and Formalisms for
the Intelligent Integration of Information. ACM TOIS
17(3), 270-293 1999
[7] U. Kruschwitz, A. De Roeck, P. Scott, S. Steel, R.
Turner, N. Webb, Extracting Semistructured Data -
Lessons Learnt, Proc. NLP2000, Patras, Greece, 2000
[8] T. Lee, M. Chams, R. Nado, S. Madnick, M. Siegel,
Information Integration with Attribution Support for
Corporate Profiles, CIKM 1999: 423-429, 1999
[9] L. Liu, C. Pu, and W. Han. XWrap: An XML-
enabled Wrapper Construction System for Web
Information Sources. Proc. IEEE ICDE, 2000
[10] David Milward, James Thomas, From Information
Retrieval to Information Extraction, ACL2000, 2000
[11] I. Muslea, S. Minton, C. Knoblock: Hierarchical
Wrapper Induction for Semistructured Information
Sources. Autonomous Agents and Multi-Agent Systems
4(1/2): 93-114 2001
[12] I. Muslea, S. Minton, and C. Knoblock.
STALKER: Learning extraction rules for
semistructured, Web-based information sources. Proc.
AAAI98: Workshop on AI and Information Integration,
Madison, Wisconsin, July 1998.
[13] S. Nestorov, S. Abiteboul, R. Motwani: Extracting
Schema from Semistructured Data. Proc. ACM
SIGMOD98, Seattle, 1998
[14] S. Nestorov, S. Abitegoul, R. Motwani, Inferring
Structure in Semistructured Data, ACM SIGMOD
Record 26(4), 1997
[15] A. Rajaraman, J. D. Ullman: Querying Websites
Using Compact Skeletons. Proc. ACM PODS 2001
[16] E. Riloff, Automatically Constructing a Dictionary
for Information Extraction Tasks, Proc. AAAI-1993,
811-816
[17] S. Soderland, D. Fisher, J. Aseltine, W. Lehnert,
CRYSTAL: Inducing a Conceptual Dictionary, Proc.
14th Int. Joint Conf. on AI, 1314-1321, 1995
[18] S. Soderland, D. Fisher, J. Aseltine, W. Lehnert,
Learning information extraction rules for semi-
structured and free text. Machine Learning, 44(1-
3):233-272, 1999
[19] T. W. Yan, H. Garcia-Molina: Duplicate Removal
in Information System Dissemination. Proc. VLDB
1995: 66-77
[20] M. Lopez, Access and Integration of Distributed,
Heterogeneous Information, 1999
http://dyade.inrialpes.fr/mediation/index.html
[21] D. Smith, M. Lopez, Information extraction for
semi-structured documents, Proc. Workshop on
Management of Semi-structured Data, Tuscon, 1997
http://www.research.att.com/~suciu/workshop-
papers.html
[22] C. Soorangura, Applying a case-based approach to
induce text extraction rules, MSc Thesis, University of
East Anglia School of Information Systems, 2000
[23] The Festival Speech Synthesis System,
http://www.speech.cs.cmu.edu/festival/manual-
1.4.1/festival_toc.html