Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
912 views6 pages

Pattern Based Indonesian Question Answering System

This document describes a pattern-based approach to developing an Indonesian question answering system using the Open Ephyra framework. The authors classify Indonesian factoid questions into 8 categories that are each trained using specific question patterns. The results demonstrate the potential of this approach for an automated Indonesian question answering system. The authors analyze question categorization and pattern learning approaches, and explain how they adapted pattern learning in Open Ephyra to categorize Indonesian questions.

Uploaded by

Eri Zuliarso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
912 views6 pages

Pattern Based Indonesian Question Answering System

This document describes a pattern-based approach to developing an Indonesian question answering system using the Open Ephyra framework. The authors classify Indonesian factoid questions into 8 categories that are each trained using specific question patterns. The results demonstrate the potential of this approach for an automated Indonesian question answering system. The authors analyze question categorization and pattern learning approaches, and explain how they adapted pattern learning in Open Ephyra to categorize Indonesian questions.

Uploaded by

Eri Zuliarso
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/258924979

Pattern Based Indonesian Question Answering System

Conference Paper · December 2009

CITATIONS READS

2 388

2 authors, including:

Hapnes Toba
Universitas Kristen Maranatha
55 PUBLICATIONS   83 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Blended Learning in Higher Education View project

IBAtS - Image Based Attendance System View project

All content following this page was uploaded by Hapnes Toba on 30 May 2014.

The user has requested enhancement of the downloaded file.


Pattern Based Indonesian Question Answering System
Hapnes Toba and Mirna Adriani
Information Retrieval Laboratory
Faculty of Computer Science, University of Indonesia
Email: [email protected], [email protected]

Abstract—This paper describes a pattern based passages [4, 5], statistical [11][17], and the
approach to Indonesian question answering system combination of semantic structures and probabilistic
using the Open Ephyra. In this study, we classify approach [18].
the factoid questions types into 8 categories, where A common feature of NLP-based QAS is the ability
each group is trained using specific questions. The to convert text input into formal representation of
results demonstrate the potential of the approach meaning such as logic (first order predicate calculus),
in an automated Indonesian question answering semantic networks, conceptual dependency diagrams,
system. or frame-based representations [13]. IR-based QAS
are usually completed with shallow or deep NLP
I. INTRODUCTION techniques, focuses on fact retrieval from a large text
corpus. Document redundancy, i.e. a number of
Q uestion answering is a form of information
retrieval that concern about an exact answer from
a given natural language question rather than a
similar statements that contain the answer, in a large
corpus and the use of shallow NLP techniques
increase the chance of finding the right answer
query string. An automated question answering without any guarantee that the answer is correct.
system (QAS) tries to retrieve explicit answers in the Shallow NLP techniques, combined with statistical
form of a single answer or snippets of text rather than methods, pattern learning, and passage retrieval, have
a whole document or set of documents. One of the been largely used in the extraction of definition
biggest challenges in QAS is how to categorize a answers in TREC and CLEF, as described in [20],
question into a particular category that further will be [23], [6], and [22]. Future research trend should
used to find exact answer(s) within a large collection involve the combination of the approaches into one
of documents. Research in QAS is initialized during single system and adapt the techniques to the
the 70‟s when [25] developed a system that can application domain.
recognize natural language in SHRDLU. Such system In the Information Retrieval Lab at the University of
is further developed by [15] in QUALM, that can Indonesia there have been some researches done to
recognized stories. The re-initiation of QAS is started initiate the Indonesian QAS and make some
in the 90‟s when the Internet has became a redundant contribution in CLEF [2, 24]. Further research
source of information [3]. approaches can be grouped in three mainstreams, i.e.
The results of two main evaluation forum in QAS, the semantic analysis [14, 12], statistical approach [1],
i.e. TREC (Text Retrieval Conference, and the combination of both [9].
http://trec.nist.gov) from 1999-2007, and CLEF (Cross In this experiment, we try to adapt the
Language Evaluation Forum, http://www.clef- categorization approach within the OE pipeline that
campaign.org/) from 2002-2008, have shown that used pattern learning approach for Indonesian
question-answering is still need further enhancements. language. Pattern learning approach is a form of rule-
The question-answering techniques that exists from based approach for question categorization. Another
the field of natural language (NLP), information approaches that mainly use are language modeling and
retrieval (IR) and combination of both, are promising, machine learning based [19]. The final goal of this
but there is still lack of standardization in methods, experiment is to investigate how to fit Indonesian
techniques and evaluation [8]. According to [3], there interrogative sentences in an English based QAS, as a
is no ultimate QAS. Each approach has its own niche, preliminary study to develop a QAS that proper to
application environment, and tasks. If the quality of Indonesian language that combines the NLP and
the answers is crucial, NLP should be applied. If facts statistical approaches.
need to be extracted from text, IR techniques should
be used. The main techniques that have been mostly
used in QAS research are: semantic analysis using
semantic role labeling, name entity recognizer, path
dependency [21], semantic markup [16], n-gram
The rest of the paper will be organized as follow:
section 2 presents the state-of-the art of Indonesian
question categorization and the pattern learning
approach that is used in Open Ephyra. Section 3
presents the strategy that is used to develop question
patterns from Indonesian interrogative sentences.
Section 4 gives the experimental setup and results.
Section 5 gives the evaluation of the results, and
section 6 is a conclusion and plans for future works.

II. QUESTION CATEGORIZATION

The study of Purwarianti, et. al. 2006 [19], has


shown that question categorization for Indonesian
interrogative sentence using shallow parser and
machine learning can achieved 95% accuracy. Fig. 1. Open Ephyra Framework (Schlaefer et.al., 2006)
However, such learning categorization requires deep
analysis in sentence structures to be able to develop a An important strategy in this approach is to
robust parser and extraction of learning features. The determine the objects interpretation from each
problem in such deep analysis for Indonesian question, i.e.: the property (PO), target (TO) and
language is the limitation of the resources of the context (CO). For example, a question “Dimana letak
language itself, that lack of robust parser. kampus UI?” (Where lies the campus of UI?). It has
Another possibility for simpler approach to question the interpretation of:
categorization is by matching the pattern of each
question type, and tries to categorize question based  Property: TEMPAT (location)
on the position of question words, and various  Target: kampus (campus)
question keywords. To make sure that the  Context: UI (University of Indonesia)
categorization can be done, we need thus to develop a
number of pattern rules that reflect the structure of The question asks for a property of the target object
each question type. “kampus”, which is a location. The context object
OE1 is an example of QAS that use pattern learning “UI”, narrows down the search to a particular place.
approach to categorize questions. It learns text Together, all of this objects form the interpretation of
patterns that can be applied to text passages for the question that further will also be used in
answer extraction. OE can learn question-answer pairs developing a query for the search engine.
and use common retrieval system, i.e. web search OE interprets a question by sequentially matching
engine or various IR system, to fetch document text. all the question patterns for each property. If it
OE is an open domain QAS that has modular and matches a pattern, OE will extract the target object
extensible framework. It consists of four main and the context object. Therefore, it is possible that a
modules (see Figure 1): 1). Question analyzer, 2) question has more than one interpretation, if the
Query generator; 3) Search engine and 4) Answer question type is not clear enough.
extractor. Each of the modules can be used Each of the properties is associated with a set of
independently and thus suitable to experiment answer pattern that is assessed and extracted during
multiple approaches to question-answering in one the second learning phase. The format of an answer
system. pattern is similar to the question patterns, it consists
There are two main steps for the pattern-learning of:
approach in OE. The first is to learn the question a target tag <T>
patterns from question templates according to each a number of random number of context tag <C>
question types. The aim of this step is to interpret the a property tag <P>.
questions and transform them into queries. The second
step is to learn the answer pattern from question- During the answer extraction phase, OE replaces all
answer pairs. The aim of this second step is to extract occurrences of target or context objects in a text
answers candidate from relevant document snippets snippet. Every time a snippet matches a pattern, the
and to rank them. The question templates need to be part of the text that associated to the property tag is
manually developed according to various interrogative extracted. In the example above, the following snippet
sentences that are independent for each natural from a relevant document:
language.

1
http://www.ephyra.info
“Bus kampus disediakan untuk melayani kebutuhan 1. The main question word(s) indicated for each
transportasi mahasiswa di dalam kampus UI Depok” type.
2. The position in which keywords of question
will be transform into occurs in an interrogative sentence, that indicates
the context and target of a question.
“Bus kampus disediakan untuk melayani kebutuhan 3. The alternative of question words or phrases that
transportasi mahasiswa di dalam <TO> <CO> indicate a special meaning to a question type.
Depok”.
As an example, these are some question patterns for
The pattern “di dalam <TO> <CO> Depok”, will be the property location ( = where) in Indonesian:
used to extract the property Depok. The pattern has
been learned from the question-answer pairs which is 1. Main question word(s):
an answer that needs to be found. (dimana|dimanakah) letak <TO> <CO>
To extract the answer, OE apply two kinds of (dimana|dimanakah) <CO> <TO>
regular expression (Zhang. 2002.): (dimana|dimanakah) <TO> berada

2. Position of keywords (context and target):


1) that covers a target tag <T>, a property <P> <TO> ada dimana (saja)?
and any characters in between them, i.e. the <TO> <PO> ada dimana
context; or or dimana (saja)? (kah|sih)? <TO>
2) covers one word or any characters preceeding berada
or following the <P> tag. <TO> terletak <CO> (apa)?
<TO> <CO> (ada)? dimana (saja)?
Finally OE will assess the answer patterns and <CO> <TO> (ada)? dimana (saja)?
assign a confidence score to the patterns. A confidence <TO> <CO> ada dimana?
score is calculated by taking the proportion on how
3. Alternative question words or phrases:
often a pattern could be used to extract a correct (apa|apakah) nama <TO> <CO>
answer and the sum of how often it extracts a wrong (apa|apakah) nama
answer and a right answer. Further, for each extracted (daerah|tempat|lokasi) <TO> <CO>
property, the total number of snippets that has been (apa|apakah) nama
assessed is also recorded, to compute the support, i.e. (daerah|tempat|lokasi) <CO> <TO>
the ratio between correct answer and number of di (lokasi|daerah|tempat) mana
snippets. The answer that has the highest confidence (saja)? <TO> <CO>
and its value is above the support threshold will be di (lokasi|daerah|tempat) mana
(saja)? <CO> <TO>
considered as the best answer candidate. terletak dimana (saja)? <TO> <CO>

III. QUESTION PATTERN DEVELOPMENT During the answer patterns extraction, i.e. after the
interpretation has been done, a tuple that consists of: a
The properties and the question patterns is target, an arbitrary number context objects, and the
determined by analyzing 8 types of factoid question, answer or the property, is generated as query string.
each consists of 5 training questions and 25 testing The query string will be used by IR system to retrieve
questions. The question types are coarse grain the passage that match the pattern. It is possible to
according to CLEF (Forner, et. al. 2008.), i.e.: selectively add tuples for properties that are not
- ORANG (people) sufficiently covered by the training question in the
- WAKTU (time) future to raise the accuracy of the system.
- TEMPAT (location) For our example in the previous section, the query
- ORGANISASI (organization) string that will be generated is: “#1(kampus) #1(UI)
- UKURAN (measure) #1(depok)”. The answer is included in the query
- ANGKA (count) string, this is to ensure that the snippet contain the
- OBYEK (object) target and the property.
- LAIN-LAIN (others).
IV. EXPERIMENTAL SETUP DAN RESULTS
For each question type, a question pattern is The document collection used in the experiment
developed which reflects the variations arise in natural consists of 5 web documents (news articles, wiki‟s,
language writings or conversations that occur in the and blog) on special issues from different domains.
training question. The testing question is used to The issues ranges from historical occasions,
evaluate the accuracy of the pattern recognizer. biography, political news, myths and poetry. Every
During the design of each question pattern, the document that downloaded has positive relevant
following steps are taking into account: judgment to a specific question type, i.e. each
document has snippets that contains answer(s) for the
question given for each type. There are in total 190
documents in 6 MB text. The documents are In the interpretation column, we separate the
processed by Perl programming language to form interpretation for “single” interpretation, i.e.: exact
TREC format document, that further indexed using only one interpretation for a question; and a “more”
Indri. The index file is 2 MB big, and used as the local interpretation, means that a question can be interpreted
corpus for the system. into one or more category, according to the patterns.
For the training phase, we prepare 40 questions, i.e. The accuracy is calculated for the interpretation
5 questions for a question type and 1 question per results and the answers. We use the accuracy
subject. Each training question is has its answer pair in definition from CLEF. 2008., i.e. the average of
TREC format, that will be used in the pattern SCORE(q) over all 200 questions q, where SCORE(q)
extraction phase. An example of the an answer is 1, if the answer to q assessed as correct, and 0
extraction result for location type, is as follow: otherwise.
The accuracy for the tested-training questions, and
<TO> [^<]*?<PO_NEtempat> the testing questions is shown in the following table:
#correct: 5
#incorrect: 6
Tested- Interpretation Accuracy 90.00
Confidence = 0.45454547
Support = 0.035311 training
questions Answers Accuracy 35.00
This mean that this pattern can be found in 11 Testing Interpretation Accuracy 61.67
passages (not necessary in 11 documents), with 5 questions
Answers Accuracy 15.00
passages contains correct answer and 6 incorrect
answer, and the total number of passages extracted V. EVALUATION
from the all 8 properties is 142 = (5/0.035311).
After the training is done, we run the OE to test The accuracy for the tested-training questions is
how accurate the result of training session. First we much higher than the testing questions. This result
run the same question as the training phase, and obtain suggests that the patterns used during the training
the following results: session are not good enough to cover the question
variations. Further we have to give extra attention to
Interpretation Answers
informality of natural language that is used to write
Property Single More Unsupporte document, especially for blogs and wiki‟s documents.
Incorrect (W) Inexact (X) Correct '(R)
Correct Wrong Correct Wrong d (U) This experiment is more concern in the interpretation
People 0 0 5 0 4 0 0 1 of questions. If we see only the interpretation, than the
Time 5 0 0 0 3 0 0 2 results is promising. Although it seems that we need to
cover more question variations.
Location 5 0 0 0 2 0 0 3 The most difficult part to interpret question patterns
Organization 2 1 1 1 2 0 0 3 is for people and organization types. In Indonesian,
Measure 4 0 1 0 3 0 0 2 both can be asked using the “siapa ( = who)” question
Number 3 0 2 0 4 0 0 1 word, and thus gives double interpretation to a
specific question, that effects on the confidence score,
Object 2 1 2 0 2 1 0 2
i.e. give lower score, because less correct answer is
Other 4 0 0 1 5 0 0 0 extracted.
Total 25 2 11 2 25 1 0 14 The object type is also hard to interpret, because the
We also test the learning result using the testing similar question pattern for this type can be occur in
questions, that consists of 5 variations for each issues, another type(s).
except for the „other‟ type, thus in total 180 questions.
The results for the testing are: VI. CONCLUSION AND FUTURE WORKS
Interpretation Answers In this experiment, we have adapted the question
Property Single More Unsupporte pattern approach in OE. The result shows that OE is
Incorrect (W) Inexact (X) Correct '(R) promising to be adapted for Indonesian language. The
Correct Wrong Correct Wrong d (U)
main shortcoming of the pattern learning approach is
People 6 6 11 2 23 0 0 2 that question patterns need to be developed
Time 14 4 7 0 21 0 0 4 specifically and the answer extraction phase needs
Location 17 7 0 1 14 0 1 10 redundant sources, i.e. large search space.
Organization 5 13 2 5 21 0 0 4 There are a number of things need to be further
investigated, such as:
Measure 7 7 10 1 20 0 2 3 1. How to generate a more generic pattern that can
Number 10 3 12 0 21 0 2 2 be used to interpret question accurately.
Object 5 18 1 1 21 1 1 2 2. How to decrease the runtime during the learning
Other 1 1 3 0 5 0 0 0 steps, i.e. how to prune the unnecessary pattern
to be learned more than once.
Total 65 59 46 10 146 1 6 27
3. How to deal with ambiguity in question words [14] Larasati, Dian S. and Manurung, Ruli. 2007. Towards a
Semantic Analysis of Bahasa Indonesia for Question
and keyword phrases that can be occurred in
Answering. Proceedings of the 10th Conference of the Pacific
more than one question types. Association for Computational Linguistics (PACLIC).
4. How to give sense of contextuality during the [15] Lehnert, Wendy G. 1977. The Process of Question Answering.
answer selection phase. For example how can Dissertation Thesis. Yale University
[16] Lopez, Vanessa, et. al. 2005. AquaLog: An Ontology-Portable
we deal with a time frame, and thus if the
Question Answering System for the Semantic Web. ESWC
question is about the president in present time, it (European Semantic Web Conference 2005), LNCS (Lecture
returns the correct answer, and not the president Notes on Computer Science) 3532, pp. 546 – 562. Springer-
in the past time, although both answers are can Verlag Berlin Heidelberg.
[17] Metzler, Donald dan Croft. W. Bruce. 2004. Analysis of
be found as patterns in the relevant documents.
Statistical Question Classication for Fact-based Questions.
Kluwer Academic Publisher.
Based on the shortcoming, adaptation of OE into [18] Narayanan, Srindi & Harbagiu, Sanda. 2004. Question
Indonesian language need the following course: Answering Based on Semantic Structures.
[19] Purwarianti, et. al. 2006. Estimation of Question Types for
1. Develop more fine grained question types that
Indonesian Question Sentence. Department of Information and
each represent special name-entity (NE) type. Computer Sciences, Toyohashi University of Technology.
For example, for type “location”, can be more [20] Ravichandran, D. and Hovy, E.H. 2002. Learning Surface
precisely defined as: university, country, etc. Text Patterns for a Question Answering System”, Proc. of
ACL-2002, ACL press, USA.
2. Modify or change the natural language specific
[21] Schlaefer, Nico. 2007. Deploying Semantic Resources for
components, such as: NE tagger, stemmer, Open Domain Question Answering. Diploma Thesis.
phrase chunker, part-of-speech (POS) tagger, Language Technologies Institute School of Computer Science
and tokenizer. Carnegie Mellon University.
[22] Schlaefer, Nico, et. al. 2006. A Pattern Learning Approach to
3. Develop a statistical/machine learning question
Question Answering within the Ephyra Framework. LNAI
classifier that can be used to categorize questions 4188, pp. 687–694, Springer-Verlag Berlin Heidelberg.
based on their features, such as: unigram, [23] Soubbotin, et.al. 2002. Use of Patterns for Detection of
bigram, or the question word. Answer Strings: A Systematic Approach, in L.P. Buckland
and E.Voorhees (eds): Proc. of TREC 2002, NIST,
Gaithersburg, USA.
REFERENCES [24] Wijono, Sri Hartati, et. al. 2006. Finding Answers to
[1] Adiwibowo, Septian. 2008. Answer Finding in Indonesian- Indonesian Questions from English Documents. Proceedings
English Question Answering System Using Word Weighting of Cross Language Evalution Forum (QA@CLEF 2006).
and Information from the Internet. Bachelor Thesis Faculty of [25] Winograd, Terry. 1971. Procedures as a Representation for
Computer Science University of Indonesia. Data in a Computer Program form Understanding Natural
[2] Adriani, Mirna. et. al. 2005. University of Indonesia Language. Dissertation Thesis. Massachusetts Institute of
Participation at CLIR - CLEF 2005. Technology.
[3] Andrenucci, Andrea & Sneiders, Eriks. 2005. Automated [26] Zhang, D., Lee, W. 2002. Web Based Pattern Mining and
Question Answering: Review of the Main Approaches. Matching Approach to Question Answering. Proceedings of
Proceedings of the Third International Conference on the 11th Text REtrieval Conference (TREC).
Information Technology and Applications (ICITA‟05) IEEE
Computer Society.
[4] Buscaldi, Davide, et. al. 2006. N-gram vs. Keyword-based
Passage Retrieval for Question Answering. Proceedings of
Cross Language Evalution Forum (QA@CLEF 2006).
[5] Buscaldi, Davide, et. al. 2009. Answering Questions with an
n-Gram Based Passage Retrieval Engine. Journal of Intelligent
Information System. Springer Netherlands. Published Online
13 March 2009. DOI 10.1007/s10844-009-0082-y.
[6] Clarke, C. et al., 2001. Web reinforced Question Answering,
in D. Harman and E. Voorhees (eds): Proc of TREC 2001,
NIST, Gaithersburg, USA.
[7] CLEF. 2008. GUIDELINES for the PARTICIPANTS in
QA@CLEF 2008.
[8] Ferrucci, et. al. 2009. IBM Research Report: Towards the
Open Advancement of Question Answering System. RC24789
(W0904-093) April 22nd.
[9] Fitria, Lily. 2007. Answer Finding using Sentence
Dependency Structure in Question Answering System. Master
Thesis Faculty of Computer Science University of Indonesia.
[10] Forner, Pamela, et. al. 2008. OVERVIEW OF THE CLEF
2008 MULTILINGUAL QUESTION ANSWERING
TRACK.
[11] Ittycheriah, Abraham, et. al. 2001. IBM‟s Statistical Question
Answering System. Proceedings of the 10th Text Retrieval
Conference (TREC 2001).
[12] Jati, Rangga. M. 2009. The Development of Natural Language
Generation with Chart Generation Approach and Its
Application in Indonesian Question Answering. Bachelor
Thesis Faculty of Computer Science University of Indonesia.
[13] Jurafsky, D and Martin, J.H. 2000. Speech and language
processing, Prentice Hall, NJ, USA.

View publication stats

You might also like