Porter Stemming Algorithm For Semantic Checking
Porter Stemming Algorithm For Semantic Checking
net/publication/260385215
Article
CITATIONS READS
20 2,425
2 authors:
All content following this page was uploaded by Noraida Haji Ali on 20 June 2015.
Abstract—Students tend to design UML class diagram words with the suffix should be stemmed to get the best
regardless of the quality and accuracy of the model as results. In this study, the search process will be repeated
design consistency in modeling process. Thus, one needs to until the search depth is determined. Without
know the quality of a UML model to be developed to
this stemming process, the end result may not satisfy
improve the modeling phase. There are some levels of
the search and meet the requirements
quality in practical UML-based projects. One of them is
model quality which emphasizes the correctness and
of the developed system.
completeness of the software models and their meanings. In The Porter Stemmer is a conflation Stemmer
this paper, we use synsets extraction process. The output developed by Martin Porter at the University of
from this process are words that contains suffix. Semantic
Cambridge in 1980 [3]. The Stemmer is based on the idea
analysis will be less précised if the words contain suffix.
that the suffixes in the English language (approximately
Hence, stemming process is important to increase the
precision of extracted words. This paper proposes 1200) which consists mainly of a combination of smaller
automated semantic checking of object-oriented model and simpler suffixes. This Stemmer is a linear step
applying the Porter Stemming algorithm in order to achieve Stemmer. In particular, it has five steps applying rules in
the quality in modeling. The results will show the each step. Figure 1 shows the clear picture of steps in
differences between synsets before and after stemming Porter Stemming Algorithm.
process.
Figure 1. Porter Stemming Algorithm [4]
Keywords-Porter Stemming algorithm, WordNet, Step 1 START
Semantic Checking
Yes
1) Remove and 2) Remove „ed‟ Was 2) 3) Recode
recode plurals or „ing‟ if found fired? remaining stem
I. INTRODUCTION
No
the same term with a root form [1]. The algorithms for Index penultimate
Does stem
contain double
Yes
Map to single
letter of stem suffix
suffix?
stemming have been studied in Computer Science since
1968. In search engine, a process called conflation treat Step 3 No
254
extracting synsets from RiTa.WordNet. This algorithm is An operation can have any number of parameters or none
chosen because of two reasons. First, it provides a simple at all.
approach to conflation that seems to work well in practice 5) Type: A data type is a classifier whose instances is
and that is applicable to a variety of languages. Second, it identified only by their value. For example date, time,
has prompted interest in stemming as a topic for research string, integer and many more.
in its own right, rather than solely as a low-level
component of an information retrieval system [14]. Porter
B. Tokenization
Stemming algorithm became the most popular and
standard approach for stemming because it is very concise A token is an instance of a sequence of characters in
and very readable for a programmer. Figure 2 shows part some particular document that are grouped together as a
of the framework in our study that applies Porter useful semantic unit for processing. The process of
Stemming Algorithm in Phase 2 of Framework to Analyze breaking a text up into its constituent tokens is known as
Semantic of Object-Oriented Model (FASOOM). tokenization. Standard methods in tokenization are [16]:
255
F. Knowledge Base (KB) (*v*) Y -> I
history -> histori
Knowledge base stores the end results of synsets
extraction process. After going through tokenization, 2) Step 2
extraction and stemming step, the end results will be the This step is much more straightforward. It deals
synsets that were extracted and have been stemmed. with pattern matching on some common suffixes. It
IV. EXPERIMENTAL SETTING removes derivational suffixes (d-suffixes) and follows
some rules such as follows:
In this experiment, we used Java as a programming
language to run the Porter Stemmer Algorithm. The (m>0) ATIONAL -> ATE
relational -> relate
experimental design is based on Figure 3 below.
3) Step 3
Phase I Phase II Phase III Phase IV
Step 3 deals with special word endings. It also
Stemming Result
Input
Process
Database
Analysis
removes derivational suffixes (d-suffixes). Composite
d-suffixes are reduced to single d-suffixes one at a time.
Therefore, if a word ends with –icational, Step 2
Figure 3. Experimental design
reduces it to –icate and Step 3 reduces it to –ic. Below
Based on Figure 3, we have worked on four phases. is the example of rules applied in Step 3.
They are input, stemming process, database and result (m>0) NESS ->
analysis. The detail of experimental design discussed as possibleness -> possible
follows. 4) Step 4
A. Phase I: Input Step 4 checked the stripped word against more
suffixes in case the word is compounded. It deals with -
In the input part, the searched word is entered as an
ic, -able, -ive and many more which are similar strategy
input. The searched word is word that contains suffix
to step3. Example of rules involved in this step is as
which then will be removed in the next phase.
shown below:
B. Phase II: Stemming Process
(m>1) MENT ->
In the processing part, the processing will be done adjustment -> adjust
based on the input entered. The Porter Stemmer
5) Step 5
Algorithm will be applied on the processing part. This is
Step 5 tidy up the algorithm after removing suffixes
where all synsets of the searched word will be stemmed.
in previous steps. It checks if the stripped word ends in
Stemming process applied five steps as described below.
a vowel and fixed appropriately. It consists of Step 5a
and Step 5b as indicated in the example:
a) Step 5a
(m>1) E
1) Step 1 probate -> probat
This step is designed to deal with past participles
and plurals. This is the most complex step and is b) Step 5b
(m>1 and *d and *L)-> single letter
separated into three parts in the original definition, 1a, bill -> bil
1b and 1c. Step 1 also removes inflectional suffixes (i-
suffixes). C. Phase III: Database
The words before and after stemming process will be
a) Step 1a: stored in database named as “synonym”. These words
SSES -> SS stored in table name “stem” which contains the unique
caresses -> caress
field “queryID”, “query” which stored the words before
b) Step 1b stemming, “stemQuery” which stored the words after
(*v*) ING ->
opening -> agree stemming and “synsetid” which stored the foreign key for
c) Step 1c synsets id from table “synsets”.
256
D. Phase IV: Result Analysis that led to loss of precision [19]. Examples of synsets with
The result is the words gain after the stemming over-stemming errors are shown in Table II.
process. Words that have been stemmed encounter with
TABLE II. EXAMPLE OF SYNSETS WITH OVER-STEMMING ERRORS
some errors. This phase analyze the result to determine
which words are stemmed to its root and which are not. Before stemming After stemming
bill bil
V. RESULTS AND DISCUSSION determination determin
explanation explan
A. Stemming Results conclusion conclus
Some synsets have been extracted from WordNet to decisiveness decis
consistence consist
test the stemming algorithm. Example of synsets before
history histori
and after applying Porter Stemming Algorithm is shown division divis
in Table I. declination declin
section sect
TABLE I. EXAMPLE OF SYNSETS THAT HAS BEEN STEMMED
Before stemming After stemming This problem can be solved by suffix choose to be
extraction extract
accounting account
removed are except ATION, ATOR, EMENT, MENT,
partitioning partition ENT, ION and E. We only choose a certain suffix to be
possibleness possible removed because of error detected when synsets are
categorisation categorization extracted.
adjustment adjust
opening open C. Performance
segmentation segment The accuracy of this algorithm is tested by calculating
reflection reflect
mean number of words (MWC) by dividing the number of
sorting sort
words entered with the accurate words after stemming.
257
Future studies will attempt to resolve problems related [8] Christian, F.J.L. "Improving the quality of UML models in
practice," in Proceedings of the 28th international conference
to errors in stemming as shown in the results. Although
on Software engineering, Shanghai, China: ACM, 2006.
the extraction of some words can be resolved with this
[9] Unhelkar, B., Process Quality Assurance for UML-Based
algorithm, there are some words that give a different Projects. Boston: Addison-Wesley Professional, 2003.
meaning after a stemming process such as the word [10] Warmer, J., et al., The Object Constraint Language: Precise
opening that stemmed to open. Synsets list that will be Modeling With Uml (Addison-Wesley Object Technology
generated when opening become the search words are Series): {Addison-Wesley Professional}, 1998.
different to the synsets list for the word open. [11] Unhelkar, B., Verification and Validation for Quality of UML
2.0 Models. 2005, John Wiley & Sons, Inc.: Hoboken, New
ACKNOWLEDGMENT Jersey.
[12] Thomasson, B., et al. "Identifying novice difficulties in object
This research was supported by a grant Tabung oriented design," in Proceedings of the 11th annual SIGCSE
Bantuan Pendidikan Khas (TBPK), Universiti Malaysia conference on Innovation and technology in computer science
Terengganu (Vot: 53057) and National Science education, Bologna, Italy: ACM, 2006.
Fellowship (NSF) under Ministry of Science, Technology [13] Noor Maizura Mohamad Noor, et al. "A New Framework to
Extract WordNet Lexicographer Files for Semi-Formal
and Innovation (MOSTI) Malaysia. We would also like to
Notation: A Preliminary Study," in 4th Internationatiol
thank the reviewers of ICCIT 2012 for their comments, Symposium on Information Technology 2010 (ITSim'10),
suggestions, and bibliographical indications. vol.2, Kuala Lumpur, Malaysia: Institute of Electrical and
Electronics Engineers Inc., 2010, pp. 1027-1031
REFERENCES [14] Willett, P., "The Porter stemming algorithm: then and now,"
[1] Grigore, M., Introduction to Stemming. 2008. Program: electronic library and information systems, vol. 40,
[2] Frakes, W.B., et al., "Strength and similarity of affix removal pp. 219-223. 2006.
stemming algorithms," SIGIR Forum, vol. 37, pp. 26-30. 2003. [15] Ali, N.H., et al., "Analyze Semantic of Object-Oriented Model
[3] Porter, M.F., "An algorithm for suffix stripping," Program, vol. Using RiTa.WordNet," Journal of Computing, vol. 3, pp. 61-
14, pp. 130-137. 1980. 66. 2011.
[4] Hooper, R., et al. The Lancaster Stemming Algorithm 2005; [16] Rennie, J., Text Classification. 2003, p. 1-47.
Available from: http://www.comp.lancs.ac.uk/ [17] Miller, G.A., "WordNet: A Lexical Database for English,"
computing/research/ stemming/Files/porter.JPG. Communication of The ACM, vol. 38, pp. 39-41. 1995.
[5] Porter, M. The Porter Stemming Algorithm. 2006 [cited 2011 [18] Paice, C.D. "An evaluation method for stemming algorithms,"
14 November 2011]; Available from: http://tartarus.org/ in Proceedings of the 17th annual international ACM SIGIR
~martin/ PorterStemmer/ index.html. conference on Research and development in information
[6] Basavaraju, M., et al., "A Novel Method of Spam Mail retrieval, Dublin, Ireland: Springer-Verlag New York, Inc.,
Detection using Text Based Clustering Approach," 1994, pp. 42-50.
International Journal of Computer Applications, vol. 5, pp. 15- [19] Airio, E., "Word normalization and decompounding in mono-
25. 2010. and bilingual IR," Information Retrieval, vol. 9, pp. 249-271.
[7] Booch, G., "UML in action," Communication of ACM, vol. 42, 2006.
pp. 26-28. 1999.
258