Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
58 views7 pages

Porter Stemming Algorithm For Semantic Checking

The document discusses using the Porter stemming algorithm to improve semantic checking of UML class diagrams by stemming words extracted from synsets. The Porter stemming algorithm reduces words to their stem or root form in 5 steps to increase precision for semantic analysis. The results will show differences in synsets before and after the stemming process.

Uploaded by

mannhi221003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views7 pages

Porter Stemming Algorithm For Semantic Checking

The document discusses using the Porter stemming algorithm to improve semantic checking of UML class diagrams by stemming words extracted from synsets. The Porter stemming algorithm reduces words to their stem or root form in 5 steps to increase precision for semantic analysis. The results will show differences in synsets before and after the stemming process.

Uploaded by

mannhi221003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/260385215

Porter Stemming Algorithm for Semantic Checking

Article

CITATIONS READS

20 2,425

2 authors:

Noraida Haji Ali Noor Asyhikin Ibrahim


Universiti Malaysia Terengganu Petroliam Nasional Berhad
41 PUBLICATIONS 302 CITATIONS 16 PUBLICATIONS 198 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Noraida Haji Ali on 20 June 2015.

The user has requested enhancement of the downloaded file.


Porter Stemming Algorithm for Semantic Checking

Noraida Haji Ali Noor Syakirah Ibrahim


Computer Science Department Computer Science Department
Faculty of Science and Technology Faculty of Science and Technology
Universiti Malaysia Terengganu Universiti Malaysia Terengganu
Terengganu, Malaysia Terengganu, Malaysia
[email protected] [email protected]

Abstract—Students tend to design UML class diagram words with the suffix should be stemmed to get the best
regardless of the quality and accuracy of the model as results. In this study, the search process will be repeated
design consistency in modeling process. Thus, one needs to until the search depth is determined. Without
know the quality of a UML model to be developed to
this stemming process, the end result may not satisfy
improve the modeling phase. There are some levels of
the search and meet the requirements
quality in practical UML-based projects. One of them is
model quality which emphasizes the correctness and
of the developed system.
completeness of the software models and their meanings. In The Porter Stemmer is a conflation Stemmer
this paper, we use synsets extraction process. The output developed by Martin Porter at the University of
from this process are words that contains suffix. Semantic
Cambridge in 1980 [3]. The Stemmer is based on the idea
analysis will be less précised if the words contain suffix.
that the suffixes in the English language (approximately
Hence, stemming process is important to increase the
precision of extracted words. This paper proposes 1200) which consists mainly of a combination of smaller
automated semantic checking of object-oriented model and simpler suffixes. This Stemmer is a linear step
applying the Porter Stemming algorithm in order to achieve Stemmer. In particular, it has five steps applying rules in
the quality in modeling. The results will show the each step. Figure 1 shows the clear picture of steps in
differences between synsets before and after stemming Porter Stemming Algorithm.
process.
Figure 1. Porter Stemming Algorithm [4]
Keywords-Porter Stemming algorithm, WordNet, Step 1 START

Semantic Checking
Yes
1) Remove and 2) Remove „ed‟ Was 2) 3) Recode
recode plurals or „ing‟ if found fired? remaining stem
I. INTRODUCTION
No

Stemming is a process for reducing inflected or Recode „y‟ to „i‟ if another


vowel is present in stem
derived words to their stem, base or root form. Stemming
is morphological analysis that tries to associate variants of Step 2

the same term with a root form [1]. The algorithms for Index penultimate
Does stem
contain double
Yes
Map to single
letter of stem suffix
suffix?
stemming have been studied in Computer Science since
1968. In search engine, a process called conflation treat Step 3 No

words with the same stem as synonyms as a kind of query


Yes
broadening. Stemming algorithms are used in many types Index final letter of
stem
Do endings
match stem
Remove
ending

of language processing, text analysis systems, information


No
retrieval and database search systems [2]. Stemming Step 4

process is essential to the operation of classifiers and Index penultimate


Do index
endings match
Yes
Do stem satisfy
Yes
Remove
letter of stem „<c>vcvc<v>‟ ending
index builders or searchers. The operation less dependent stem?

on particular word forms and therefore reduces the Step 5


No No

potential size of vocabularies, which might otherwise


Remove final „e‟ only if more than one
Output stem
have to contain all possible forms. It might be useful to consonant sequence is present in stem

think of stemming as the automatic definition of a group


of synonyms for a particular word. In WordNet, a few

© ICCIT 2012 253


Within each step, if the suffix rule matched the words, 5) Process quality: the activities, tasks, roles and
then the conditions attached to that rule are tested on what deliverables employed in developing software.
would be the resulting stem, if that suffix was removed, in 6) Management quality: planning, budgeting and
the way defined by the rule. For example such a situation monitoring, as well as the “soft” or human aspects of a
can be the number of vowel characters followed by a project.
consonant character in the stem of which must be greater Quality environment: all aspects of creating and
than one for the rules applied. maintaining the quality of a project, including all the
above aspects of quality.
If any suffix rule matches the word, then the condition
attach to it are tested and stem is obtained by removing The semantic aspect of model quality ensures not only
the suffix. This process continues until either a rule from that the diagrams produced are correct, but also that they
that step fire and control passes to the next step or there faithfully represent the underlying reality represented in
are no more rules in that step when control moves to the the domain [10]. Therefore, semantic checking should be
next step. The resultant stem being returned by the implemented to ensure that the elements in UML class
Stemmer after control has been passed from step five. diagram are correctly defined. Semantic checks deal with
the meaning behind an element or a diagram. Therefore,
The Porter Stemmer has been used in various studies.
this check focuses not on the correctness of the
It is the most widely used in Information Retrieval
representation but on the completeness of the meaning
research [3]. Implementations of this stemmer are also
behind the notation [11]. Evaluating some problems in
available at a website established by Porter himself, with
modeling, research from Thomasson shows the difficulties
implementations in Java, C, PERL and many more [5]. In
in designing the appropriate UML class diagram [12].
other research, Porter Stemming Algorithm has been used
They are:
for classifying the emails into spam and non-spam emails
more efficiently [6].  The variation of the design form.

II. SEMANTIC CHECKING BACKGROUND  Naming the notation element.


Unified Modeling Language (UML) is now becoming  Free in designing.
as a standard tool by the Object Management Group
(OMG) in November 1997. It is mainly used to probe and  Difficult to state the class or object.
obtain the user needs and to design object-oriented  Difficult to elaborate the requirement.
software system [7]. In modeling, designing the UML
class diagram is an important phase. Nevertheless, the The UML class diagrams designed by students always
UML lacks of formal semantics, for example the meaning neglect the modeling quality such as correctness and
of the elements of a UML model is not formally defined completeness. These problems must be overcome to
and may depend on the interpretation of individuals who ensure that no duplicate in the class name and the
are using the UML [8]. One needs to know the quality of a inheritance relationships are valid. Therefore, a
UML model to be developed to improve the modeling framework will be developed to overcome the
phase. Unhelkar (2003) lists the following levels of inconsistency problem in UML class diagram [13].
quality in practical UML-based projects [9]: III. PORTER STEMMER FOR SEMANTIC CHECKING
1) Data quality: the accuracy and reliability of the There are many problems when trying to conflate
data, resulting in quality work ensuring the integrity of the English words. Among them is the presence of a strong
data. word that does not follow the pattern set for the voice
2) Code quality: the correctness of the programs and changes and will change their stem when forming tenses
their underlying algorithms. such as “throwing”, “threw” and “thrown”. This
3) Model quality: the correctness and completeness of complexity leads to some errors in stemming which any
the software models and their meanings. unrelated words being conflated together and unrelated
4) Architecture quality: the quality of the system in terms being matched. However, the methods proposed to
terms of its ability to be deployed in operation. address this issue are very complex. This research applied
Porter Stemming algorithm for getting the root word after

254
extracting synsets from RiTa.WordNet. This algorithm is An operation can have any number of parameters or none
chosen because of two reasons. First, it provides a simple at all.
approach to conflation that seems to work well in practice 5) Type: A data type is a classifier whose instances is
and that is applicable to a variety of languages. Second, it identified only by their value. For example date, time,
has prompted interest in stemming as a topic for research string, integer and many more.
in its own right, rather than solely as a low-level
component of an information retrieval system [14]. Porter
B. Tokenization
Stemming algorithm became the most popular and
standard approach for stemming because it is very concise A token is an instance of a sequence of characters in
and very readable for a programmer. Figure 2 shows part some particular document that are grouped together as a
of the framework in our study that applies Porter useful semantic unit for processing. The process of
Stemming Algorithm in Phase 2 of Framework to Analyze breaking a text up into its constituent tokens is known as
Semantic of Object-Oriented Model (FASOOM). tokenization. Standard methods in tokenization are [16]:

Students‟  Separate on whitespace


Answers
 Alphabetic strings
Tokenization  Alphanumeric strings
Students‟ answers stored in a file are ordered by UML
Synsets extraction
using Rita.WordNet
WordNet class diagram properties. Hence, tokenization is
implemented using the separate on whitespace methods to
make sure that only UML class name are extracted from
Stemming process
students answers which stores in a file.
C. WordNet Database
KB
WordNet has been developed as a useful tool which
Figure 2. Synsets Extraction Process in FASOOM Framework. combines thesaurus and dictionary [17]. It is a large
lexical database for various languages. Nouns, verbs,
Synsets extraction process is a part of FASOOM adjectives and adverbs are put together into sets of
framework that has been discussed in detail in [15]. This cognitive synonyms (synsets), each expressing a distinct
phase describes the steps in synsets extraction process concept. WordNet store synsets with its definition and
which includes the stemming process. relation. A list of synsets from WordNet will be extracted
A. Students’ Answers: after a search word is inserted. These synsets can be used
to compare the synonyms in UML class name.
Students‟ answers are UML class diagram that has
been extracted and stored in a file. The properties of UML D. Extraction using Rita.WordNet
class diagram that included in file are name of UML class, Rita.WordNet provides some getAllSynsets() method
attributes, operator, parameter and type. of Rita.WordNet has been used for extracting synsets from
1) UML class name: The name of class WordNet. This method returns a string of words (in array)
2) Attributes: Instances of property that is owned by in each synset for all senses of word with part of speech
(nouns, verbs, adjectives and adverbs), or null if not
the class. As an example, the class Transaction can have
found.
the attributes such as date and time.
3) Operator: Represent the functions or tasks that can E. Stemming Process
be performed on the data in the class. As an example, the Some synsets that were extracted from WordNet
class Account can have the operator likes withdraw and contains suffix. Hence, stemming techniques used to
deposit. produce the root of the resulting synsets. Thus, Porter
4) Parameter: A parameter specifies a type of Stemming Algorithm was chosen to perform this
argument and the value it takes in the call to an operation. technique. Details of the result from the stemming process
will be discussed in the next section.

255
F. Knowledge Base (KB) (*v*) Y -> I
history -> histori
Knowledge base stores the end results of synsets
extraction process. After going through tokenization, 2) Step 2
extraction and stemming step, the end results will be the This step is much more straightforward. It deals
synsets that were extracted and have been stemmed. with pattern matching on some common suffixes. It
IV. EXPERIMENTAL SETTING removes derivational suffixes (d-suffixes) and follows
some rules such as follows:
In this experiment, we used Java as a programming
language to run the Porter Stemmer Algorithm. The (m>0) ATIONAL -> ATE
relational -> relate
experimental design is based on Figure 3 below.
3) Step 3
Phase I Phase II Phase III Phase IV
Step 3 deals with special word endings. It also
Stemming Result
Input
Process
Database
Analysis
removes derivational suffixes (d-suffixes). Composite
d-suffixes are reduced to single d-suffixes one at a time.
Therefore, if a word ends with –icational, Step 2
Figure 3. Experimental design
reduces it to –icate and Step 3 reduces it to –ic. Below
Based on Figure 3, we have worked on four phases. is the example of rules applied in Step 3.
They are input, stemming process, database and result (m>0) NESS ->
analysis. The detail of experimental design discussed as possibleness -> possible
follows. 4) Step 4
A. Phase I: Input Step 4 checked the stripped word against more
suffixes in case the word is compounded. It deals with -
In the input part, the searched word is entered as an
ic, -able, -ive and many more which are similar strategy
input. The searched word is word that contains suffix
to step3. Example of rules involved in this step is as
which then will be removed in the next phase.
shown below:
B. Phase II: Stemming Process
(m>1) MENT ->
In the processing part, the processing will be done adjustment -> adjust
based on the input entered. The Porter Stemmer
5) Step 5
Algorithm will be applied on the processing part. This is
Step 5 tidy up the algorithm after removing suffixes
where all synsets of the searched word will be stemmed.
in previous steps. It checks if the stripped word ends in
Stemming process applied five steps as described below.
a vowel and fixed appropriately. It consists of Step 5a
and Step 5b as indicated in the example:
a) Step 5a
(m>1) E
1) Step 1 probate -> probat
This step is designed to deal with past participles
and plurals. This is the most complex step and is b) Step 5b
(m>1 and *d and *L)-> single letter
separated into three parts in the original definition, 1a, bill -> bil
1b and 1c. Step 1 also removes inflectional suffixes (i-
suffixes). C. Phase III: Database
The words before and after stemming process will be
a) Step 1a: stored in database named as “synonym”. These words
SSES -> SS stored in table name “stem” which contains the unique
caresses -> caress
field “queryID”, “query” which stored the words before
b) Step 1b stemming, “stemQuery” which stored the words after
(*v*) ING ->
opening -> agree stemming and “synsetid” which stored the foreign key for
c) Step 1c synsets id from table “synsets”.

256
D. Phase IV: Result Analysis that led to loss of precision [19]. Examples of synsets with
The result is the words gain after the stemming over-stemming errors are shown in Table II.
process. Words that have been stemmed encounter with
TABLE II. EXAMPLE OF SYNSETS WITH OVER-STEMMING ERRORS
some errors. This phase analyze the result to determine
which words are stemmed to its root and which are not. Before stemming After stemming
bill bil
V. RESULTS AND DISCUSSION determination determin
explanation explan
A. Stemming Results conclusion conclus
Some synsets have been extracted from WordNet to decisiveness decis
consistence consist
test the stemming algorithm. Example of synsets before
history histori
and after applying Porter Stemming Algorithm is shown division divis
in Table I. declination declin
section sect
TABLE I. EXAMPLE OF SYNSETS THAT HAS BEEN STEMMED

Before stemming After stemming This problem can be solved by suffix choose to be
extraction extract
accounting account
removed are except ATION, ATOR, EMENT, MENT,
partitioning partition ENT, ION and E. We only choose a certain suffix to be
possibleness possible removed because of error detected when synsets are
categorisation categorization extracted.
adjustment adjust
opening open C. Performance
segmentation segment The accuracy of this algorithm is tested by calculating
reflection reflect
mean number of words (MWC) by dividing the number of
sorting sort
words entered with the accurate words after stemming.

As shown in Table I, several synsets which have been  WCW  


tested show the successful application of Porter Stemming
Algorithm for stemming the synsets into their Where,
corresponding roots or stems. This application will be
W = Number of words entered
useful to narrow down the search results for a certain
synsets that have similar word with its search word. For an A = Accurate words after stemming
example, when we search for the word account, one of its
To test the accuracy of Porter Stemming algorithm, we
synsets is accounting. After stemming process, the word
enter 50000 words as input and the accurate number after
accounting will transform into word account which is
stemming is 11320. By substituting these values in (1), we
similar to the search word. Hence, the word accounting
got:
will not be listed in the synsets that have been searched.
MWC = 50000 / 11320
B. Stemming Errors
Evaluating the stemming algorithms will lead into two MWC = 4.42
common problems for word standardization such as VI. CONCLUSION
under-stemming errors or over-stemming errors [18]. The
In this study, stemming algorithm is as an added value
synsets extraction process will fail if the word with
to enhance the search results. The use of stemming
stemming errors be the search word because the word has
algorithms can give the root word to the synsets that has
no meaning. In this research, some of the synsets extracted
been retrieved from WordNet. This will increase the
become meaningless after stemming process is applied
precision of the synsets before we search the synsets in
because of over-stemming errors. Over-stemming happens
depth.
when too long a suffix is removed from the word. Two or
more words with separate meaning may get the same stem

257
Future studies will attempt to resolve problems related [8] Christian, F.J.L. "Improving the quality of UML models in
practice," in Proceedings of the 28th international conference
to errors in stemming as shown in the results. Although
on Software engineering, Shanghai, China: ACM, 2006.
the extraction of some words can be resolved with this
[9] Unhelkar, B., Process Quality Assurance for UML-Based
algorithm, there are some words that give a different Projects. Boston: Addison-Wesley Professional, 2003.
meaning after a stemming process such as the word [10] Warmer, J., et al., The Object Constraint Language: Precise
opening that stemmed to open. Synsets list that will be Modeling With Uml (Addison-Wesley Object Technology
generated when opening become the search words are Series): {Addison-Wesley Professional}, 1998.
different to the synsets list for the word open. [11] Unhelkar, B., Verification and Validation for Quality of UML
2.0 Models. 2005, John Wiley & Sons, Inc.: Hoboken, New
ACKNOWLEDGMENT Jersey.
[12] Thomasson, B., et al. "Identifying novice difficulties in object
This research was supported by a grant Tabung oriented design," in Proceedings of the 11th annual SIGCSE
Bantuan Pendidikan Khas (TBPK), Universiti Malaysia conference on Innovation and technology in computer science
Terengganu (Vot: 53057) and National Science education, Bologna, Italy: ACM, 2006.
Fellowship (NSF) under Ministry of Science, Technology [13] Noor Maizura Mohamad Noor, et al. "A New Framework to
Extract WordNet Lexicographer Files for Semi-Formal
and Innovation (MOSTI) Malaysia. We would also like to
Notation: A Preliminary Study," in 4th Internationatiol
thank the reviewers of ICCIT 2012 for their comments, Symposium on Information Technology 2010 (ITSim'10),
suggestions, and bibliographical indications. vol.2, Kuala Lumpur, Malaysia: Institute of Electrical and
Electronics Engineers Inc., 2010, pp. 1027-1031
REFERENCES [14] Willett, P., "The Porter stemming algorithm: then and now,"
[1] Grigore, M., Introduction to Stemming. 2008. Program: electronic library and information systems, vol. 40,
[2] Frakes, W.B., et al., "Strength and similarity of affix removal pp. 219-223. 2006.
stemming algorithms," SIGIR Forum, vol. 37, pp. 26-30. 2003. [15] Ali, N.H., et al., "Analyze Semantic of Object-Oriented Model
[3] Porter, M.F., "An algorithm for suffix stripping," Program, vol. Using RiTa.WordNet," Journal of Computing, vol. 3, pp. 61-
14, pp. 130-137. 1980. 66. 2011.

[4] Hooper, R., et al. The Lancaster Stemming Algorithm 2005; [16] Rennie, J., Text Classification. 2003, p. 1-47.
Available from: http://www.comp.lancs.ac.uk/ [17] Miller, G.A., "WordNet: A Lexical Database for English,"
computing/research/ stemming/Files/porter.JPG. Communication of The ACM, vol. 38, pp. 39-41. 1995.
[5] Porter, M. The Porter Stemming Algorithm. 2006 [cited 2011 [18] Paice, C.D. "An evaluation method for stemming algorithms,"
14 November 2011]; Available from: http://tartarus.org/ in Proceedings of the 17th annual international ACM SIGIR
~martin/ PorterStemmer/ index.html. conference on Research and development in information
[6] Basavaraju, M., et al., "A Novel Method of Spam Mail retrieval, Dublin, Ireland: Springer-Verlag New York, Inc.,
Detection using Text Based Clustering Approach," 1994, pp. 42-50.
International Journal of Computer Applications, vol. 5, pp. 15- [19] Airio, E., "Word normalization and decompounding in mono-
25. 2010. and bilingual IR," Information Retrieval, vol. 9, pp. 249-271.
[7] Booch, G., "UML in action," Communication of ACM, vol. 42, 2006.
pp. 26-28. 1999.

258

View publication stats

You might also like