Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views6 pages

A Novel Arabic Lemmatization Algorithm

Uploaded by

rickshark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views6 pages

A Novel Arabic Lemmatization Algorithm

Uploaded by

rickshark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Novel Arabic Lemmatization Algorithm

Eiman Al-Shammari Jessica Lin


George Mason University George Mason University
Kuwait University 4400 University Drive, MS4A5
4400 University Drive, MS4A5 Fairfax, VA 22030
Fairfax, VA 22030 [email protected]
[email protected]

ABSTRACT 1. INTRODUCTION
Tokenization is a fundamental step in processing textual data
preceding the tasks of information retrieval, text mining, and The Internet is witnessing an explosive growth in the field of
natural language processing. Tokenization is a language- information search and retrieval. Unfortunately, due to the
dependent approach, including normalization, stop words language differences, this growth is limited to the language in
removal, lemmatization and stemming. which it was developed (usually English) and cannot be easily
Both stemming and lemmatization share a common goal of transferred to different linguistic environments. Some languages
reducing a word to its base. However, lemmatization is more share similar structures, whereas others are totally different. In
robust than stemming as it often involves usage of such cases, text processing algorithms developed for a specific
vocabulary and morphological analysis, as opposed to language cannot be applied to other languages.
simply removing the suffix of the word. In this work, we
introduce a novel lemmatization algorithm for the Arabic Arabic is the sixth most widely spoken language in the world.
Language. According to the Global Research’s (2004) estimate, there are
The new lemmatizer proposed here is a part of a 10.5 million Arabic speakers with access to the Internet,
comprehensive Arabic tokenization system, with a stop compared to 287.5 million English speakers. Unfortunately,
words list exceeding 2200 Arabic words. Currently, there are efforts to improve Arabic information search and retrieval
two Arabic leading stemmers: the root-based stemmer and compared to other languages are limited and modest. The barrier
the light stemmer. We hypothesize that lemmatization would to text processing advancements in Arabic is the very complicated
be more effective than stemming in mining Arabic text. We morphological structure of the Arabic language.
investigate the impact of our new lemmatizer on
unsupervised data mining techniques in comparison to the Stemming is a computational process for reducing words to their
leading Arabic stemmers. We conclude that lemmatization is root (or stem),[1] and it can be viewed as a recall-enhancing
a better word normalization method than stemming for device or a precision-enhancing device. As a result, stemmers are
Arabic text. basic elements in query systems, indexing, web search engines
and information retrieval systems (IRS).

Categories and Subject Descriptors The current Arabic stemming approaches only focus on the
H.3.1 [Information Storage and Retrieval]: Content Analysis morphological structure. Ignoring Arabic basic rules can cause
and Indexing – Indexing methods, Linguistic processing; H.3.3 errors in automatic translation, text clustering, text summarization,
[Information Storage and Retrieval]: Information Search and and NLP. Currently, there are two Arabic leading stemmers: the
Retrieval – Clustering. root-based stemmer and light stemmer.
The structure of Arabic makes it harder to stem the words to their
General Terms roots. Common stemming errors that stemmers suffer from
Algorithms, Documentation, Experimentation, Standardization, include over-stemming , under-stemming, and mis-stemming.
Languages.
This paper presents a new stemming algorithm that relies on
Keywords Arabic language morphology and Arabic language syntax. The
Text Mining, Arabic, Stemming, Lemmatization, Tokenization addition of the syntactical knowledge creates what is known as a
lemmatizer in linguistics. The automated addition to the syntactic
knowledge reduces both stemming errors and stemming cost.
This paper is organized as follows: In Section 2, we briefly review
Permission to make digital or hard copies of all or part of this work for the Arabic language morphology and discuss previous Arabic
personal or classroom use is granted without fee provided that copies are language tokenization process. Our methodology is presented in
not made or distributed for profit or commercial advantage and that section 3, followed by the proposed lemmatization algorithm in
copies bear this notice and the full citation on the first page. To copy Section 4. In Section 5, we present the evaluation criteria and the
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
experimental results. Finally, we conclude our study and discuss
AND’08, July 24, 2008,Singapore. future work in Section 6.
Copyright © 2008 ACM 978-1-60558-196-5... $5.00

113
‫خ‬ ‫ح‬ ‫ج‬ ‫ث‬ ‫ت‬ ‫ب‬ ‫ا‬ Automatic Arabic stemming proved to be an effective technique
for text processing for small collections [2], [3] and large
‫ص‬ ‫ش‬ ‫س‬ ‫ز‬ ‫ر‬ ‫ذ‬ ‫د‬ collections [4],[5] of documents. Xu et al. [6] showed that spelling
normalization combined with the use of tri-grams and stemming
‫ق‬ ‫ف‬ ‫غ‬ ‫ع‬ ‫ظ‬ ‫ط‬ ‫ض‬ could significantly improve Arabic text processing by 40%.

‫ي‬ ‫و‬ ‫ه‬ ‫ن‬ ‫م‬ ‫ل‬ ‫ك‬ The two most effective Arabic stemmers are Larkey’s light
stemmer[4] [5] and Khoja’s [7] root-extraction stemmer. In
Figure 1: Arabic Particles (letters) addition, Duwairi [8] , El-Kourd et al.[9] and Mustafa et al.[10]
discovered that N-gram stemming technique is not efficient for
Arabic Text processing. In summary, Arabic stemming produced
2. Background and Related Work promising results in some applications and failed in others.
Over-stemming and under-stemming are the main drawbacks of
Arabic language is a semantic language with a composite the root-based stemming and the light stemming algorithms
morphology. The words are categorized as particles, nouns, or respectively. Over-stemming, under-stemming and mis-stemming
verbs. There are 29 letters in Arabic, and the words are formed by are all stemming errors that usually degrade the correctness of
linking letters of the alphabet. Figure 1 shows a list of Arabic stemming algorithms[11].
letters. Unlike most Western languages, Arabic script is written As stated in[12], mis-stemming is defined as “taking off what
from right to left. The letters are connected and do not start with looks like an ending, but is really part of the stem,” and over-
capital letter as in English. Due to the unique characteristics of stemming is “taking off a true ending which results in the
Arabic language, one particularly challenging task for machines is conflation of words of different meanings”.
to recognize and extract proper nouns from Arabic texts. Arabic stemmers blindly stem all the words and perform poorly
Furthermore, in English, words are formed by attaching prefixes especially with compound words, proper nouns and foreign
and suffixes to either or both sides of the root. For example the Arabized words. The main cause of this problem is the
word Untouchables is formed as follows stemmer’s lack of knowledge of the word lexical category (i.e.
noun, verb, proposition, etc.)

Un touch able S A possible solution for this problem is to add a lookup dictionary
to check the roots. Although this solution seems straightforward
Prefix Root First Suffix Second Suffix and easy, this process is computationally expensive. Al-Fedaghi
and Al-Anzi [13] estimated that there are around 10,000
independent roots. Each root word can have prefixes, suffixes,
In Arabic, additions to the root can be within the root (not only on infixes, and regular and irregular tenses.
the word sides) which is called a pattern. This causes a serious
issue in stemming Arabic documentation because it is hard to Another solution is to define a rule to stem words instead of
differentiate between root particles and affix particles. chopping off the letters blindly; this rule is set by the syntactical
structure of the word. For example verbs require aggressive
Table 1 displays an example of the Arabic Word = ‫اﻟﺸﺎرب‬ stemming and need to be represented by their roots. Nouns on the
(drinker) and its stems with the common prefixes and suffixes. contrary only require light suffixes and prefixes elimination. This
advanced stemming is known as Lemmatization[14].
Table 1. Arabic Example
Lemmatization is a normalization technique[5], generally defined
Prefixes + Stem ( Root + Pattern) + Suffixes as “the transformation of all inflected word forms contained in a
text to their dictionary look-up form” [15]
Root ‫ﺷﺮب‬ drink
Prefixes ‫ال‬ the To our best knowledge there has been no proposed algorithm for
Stem ‫ﺷﺎرب‬ drinker Arabic Lemmatization.
Suffixes ‫ﯾﻦ‬, ‫ان‬ dual In this work, we propose the first Arabic lemmatization
Suffixes ‫ون‬ plural algorithm, and we hypothesize that lemmatization will be more
Suffixes ‫ة‬ feminine efficient in tokenizing Arabic documents than stemming. In
‫اﻟﺸﺎرﺑﯿﻦ اﻟﺸﺎرﺑﺎن‬ the drinkers (dual) addition to the general stemming benefits, lemmatization can
‫اﻟﺸﺎرﺑﯿﻦ‬ the drinkers (plural) overcome the stemming errors and reduce stemming cost by
‫اﻟﺸﺎرب‬ the drinker (masculine) reducing unnecessary stemming.
‫اﻟﺸﺎرﺑﺔ‬ the drinker (feminine)

Due to the Arabic morphological structure, Arabic requires a


3. Methodology
different stemming process from other languages.
Stemming Arabic documentations was done manually prior to Tokenization often performs stop words removal early in the
TREC (Text Retrieval Conference) and only applied on small process, although there is currently no standardized list of Arabic
corpora. As mentioned, the most common Arabic stemming Stop Words. The current available list [4] introduces less than 200
approaches are the root-based and the light stemmers. words. Table 2,3,4, and 5 show a subset of these words.
We are able to define more than 2,200 stop words and categorize
them into useful and useless stop words. Useless stop words are

114
stop words that are used extensively and give no benefits to the Table 4 . Arabic Independent Pronouns
subsequent words. Table 3 and 5 are examples of useless stop
words. Useful stop words are words that can indicate the Word English Equivalence
syntactical categories of the subsequent words. For example, in ‫ﻧﺤﻦ‬ Us
an English sentence such as “I went to school yesterday,” it is ‫أﻧﺎ‬ I am
easy to realize that school is a noun and thus does not require ‫أﻧﺖ‬/ ‫أﻧﺘﻢ‬ You (feminine/Masculine) Plural
aggressive stemming. ‫أﻧﺘﻦ‬ You (feminine/Masculine) Singular
Unfortunately, due to the early removal of the stop words, we lost ‫ھﻲ‬/ ‫ھﻮ‬ She/he
this valuable information. The same scenario applies to Arabic ‫ھﻦ‬/‫ھﻢ‬ Them (feminine/Masculine)
language too. We believe that the useful stop words can help us ‫ھﻤﺎ‬ Them (dual)
identify nouns and verbs and direct us into the appropriate
stemming. Our algorithm can also be considered as an advanced Table 5. Arabic Demonstrative prepositions
stemmer, in which identified nouns and verbs are used to generate
global nouns and verbs dictionaries. The benefit of these Preposition English Equivalence
dictionaries is to find similar nouns in the corpus that were used ‫ھﺬا‬ This : used for masculine
differently in other sentences. For example in the following ‫ھﺬه‬ This : used for feminine
paragraph the word School is identified as a noun and was ‫ذﻟﻚ‬ That : used for masculine
recognized as a noun in the following sentence. ‫ﺗﻠﻚ‬ This : used for feminine
I went to school yesterday, I love school. ‫أوﻟﺌﻚ‬ These
In table 2, we show sub list of stop words preceding verbs, and ‫ھﺆﻻء‬ Those
table 3 presents some of the stop words preceding nouns. Our
stop words list was initially generated by three methods; English 4. Arabic Lemmatization Algorithm
stop words translation, identification of common words in
arbitrary Arabic documents, and manual search of synonyms to
the previously identified stop words. As shown in Figure 2, our novel algorithm consists of different
phases. During the first phase, useless stop words are removed to
In the following section we will describe our algorithm in details. reduce the size of the corpus. Next, we identify nouns by either
locating noun preceding stop words or words starting by definite
articles. These nouns are lightly stemmed by removing suffixes
Table 2. Preposition Preceding Verbs
and prefixes and then added to the global nouns dictionary[16]. At
Preposition English this level, these words are flagged as nouns as a preparation to the
‫ﺣﯿﺜﻤﺎ‬ Wherever stemming phase. In parallel to that process we find verbs by
locating verbs preceding stop words. Similar to the nouns, the
‫ﻛﻠّﻤﺎ‬ Whenever
verbs are added to the global verb dictionary and tagged as verbs.
‫إذا‬ If
In Arabic, we cannot have two consecutive verbs, thus any word
‫ﻋﻨﺪﻣﺎ‬ When (not for question)
following a verb is either a stop word or a noun. If the word is not
a stop word then the word is added to the noun dictionary and
Table 3. Arabic circumstantial nouns indicating time and flagged as a noun.
place
Preposition English Equivalence Before we direct a word to the appropriate stemming by the word
‫ﺑﻌﺪ‬ After flag, all the stop words are removed since they offer no further
‫ﻋﻠﻰ‬ Over advantage. Other words that do not belong in any category will be
‫ﻓﻮق‬ Above, up treated as nouns and stemmed lightly.
‫إﻟَﻰ‬ until ,near, towards ,to Table 6 below summarizes the algorithm.
‫أﻣﺎ َم‬ in front of:
‫ﺑﺎﺗﺠﺎه‬ On the direction of
‫ﺑﺠﺎﻧﺐ‬ Aside, next to, beside Table 6. Arabic Lemmatization Algorithm
‫ﺗﺤﺖ‬ Below, beneath, down Input: Arabic document
‫ﺣﺘﻰ‬ Till (time and location) Output: Stemmed document.
‫ﺧﺎرج‬ Outside of Noun Dictionary.
‫ﺧﻼل‬ Through, during, Verbs Dictionary.
‫ﻋﺒﺮ‬ Through
‫ﻋﻦ‬ From, about V: Verb dictionary (one dimensional array sorted alphabetically1)
N: Noun dictionary (one dimensional array sorted alphabetically)
‫ﻓﻲ‬ In (time, location, duration) NSW: Array of stop words proceeding nouns
‫ﻗﺒ َﻞ‬ Before VSW: Array of stop words proceeding verbs
‫ﻗﺮﯾﺐ‬ Near SW: Array of stop words (including both NSW and VSW)
‫ﻣﻨﺬ‬ since
‫وراء‬ Behind , Beyond Phase Zero: Remove useless stop words.
‫ﺑﯿﻦ‬ Between
1
For fast lookup, these dictionaries can be implemented using hash tables

115
Phase One: Simple Noun identification the effect on clustering three different groups. More details are
included in the results subsection.
Locate words attached
ched to definite articles, and preceded by
NSW and flag them as Nouns
Phase Two: Suffix and Prefix removal
Apply suffix and prefix approach to the entire document.
Longest suffixes and prefixes are removed first.
Phase Three: Noun Dictionary Generation
Add the identified processed words to N.
Phase Four: Verbs Identification
Verbs are proceeded by VSW
Phase Five: Verb Dictionary Generation
Add the identified processed words to V
Phase Six : Find all noun tokens
Phase Seven: Stop Word Removal
Remove useful and useless stop words
Phase Eight: Root Extraction for Verbs
Roots are extracted by comparing Verbs to Arabic Root
patterns, words (tokens) with missing tags are considered
nouns and lightly stemmed.

5. Evaluation and Experiments

We compare our Arabic Lemmatization algorithm to the leading


Arabic root-based
based stemmer presented by Khoja. We observe the
stemmer effects on improving the document clustering
performance in comparison to the Khoja stemmer.

5.1 Experiment setting


5.1.1 Data Description
In our experiments, we use modern, unedited,
unedited and unmarked
Arabic text, a sample of approximately 7000 documents from
various Arabic online resources to construct three datasets. The
first dataset contains economical articles drawn randomly from
Al-Watan
Watan (2008), a newspaper from Kuwait. The second dataset
is extracted from Arabic medical websites. This dataset contains
two subsets: Kidney failure related articles and physiology related
documents. The last dataset contains randomly selected
documents retrieved from the ACE2004 (ACE 2004 Multilingual
Training Corpus, Linguistic Data Consortium) Figure 2. The Lemmatization Algorithm Simplified

5.1.2 Clustering
5.1.3 Performance Measure
Cluster analysis [17] is the process of dividing objects into groups
of similar objects according to a distance measure. Clustering is To measure the quality of clusters we use the Cluster Purity
applied in many fields including text mining and machine [19]which is the percentage of documents correctly labeled. The
learning. K-means is a widely usedd partitional clustering method overall purity is the weighted sum of individual clusters purities.
with a linear time complexity[18] . The algorithm that will lead to a better clustering is the algorithm
To design our experiments we chose to study our algorithm’s generating a higher overall purity.
effect on improving the performance of the K-means
means clustering in
contrast to the Khoja stemmer’s effect. We used the TFIDF
weighted function and created three different experiments sets.
5.2 Results
The first and the second experiments are to cluster documents We perform three
ree different experiments designed to compare the
belonging to two different groups. The last experiment is to study Lemmatization and the Stemming effects on improving the purity

116
of K-mean clustering. On the first experiment our goal is to study [3] I.A. Al-Kharashi and M.W. Evens, “Comparing Words,
the effect on highly relevant documents, thus we chose the two Stems, and Roots as Index Terms in an Arabic Information
medical subsets. Both Khoja stemmer and our lemmatizer Retrieval System.,” Journal of the American Society for
achieved the same clustering purity. In the second experiment, we Information Science, vol. 45, 1994, pp. 548-60.
choose two contrasting datasets,: Medical and News, The
clustering purity for the lemmatized documents was 10% higher
than the clustering purity for the stemmed documents. [4] L.S. Larkey and M.E. Connell, “Arabic Information
Retrieval at UMass in TREC-10,” Proceedings of the Tenth
This result was expected since the Khoja stemmer tends to over Text REtrieval Conference (TREC-10)”, EM Voorhees and
stem words, which leads to creating similarities between unrelated DK Harman ed, 2001, pp. 562-570.
documents containing same roots for different words.
Finally, for our last experiment, we choose three contrasting
datasets: News, Economics , and Medical. [5] L.S. Larkey, L. Ballesteros, and M.E. Connell, “Improving
stemming for Arabic information retrieval: light stemming
Applying K-means clustering on the three datasets leads to an and co-occurrence analysis,” Tampere, Finland: ACM,
overall cluster purity of 70.8% for the lemmatized documents and 2002, pp. 275-282.
58% for the stemmed document.
We examine the characteristics of each cluster and notice that
medical and economical documents are mis-clustered due to the [6] J. Xu, A. Fraser, and R. Weischedel, “Empirical studies in
existence of many similar words like high and low (i.e. High strategies for Arabic retrieval,” Proceedings of the 25th
temperatures and high stock prices). annual international ACM SIGIR conference on Research
and development in information retrieval, 2002, pp. 269-
274.
6. Conclusion and Future Work
[7] S. Khoja and R. Garside, “Stemming Arabic Text,”
In this paper we introduce the first Arabic Lemmatization
Lancaster, UK, Computing Department, Lancaster
Algorithm and compare its performance with the Khoja stemming
University, 1999.
Algorithm for clustering applications.
Additionally we introduce a new framework to normalize Arabic
documents by overcoming the limitations of previous approaches, [8] R. Duwairi, “A Distance-based Classifier for Arabic Text
caused by the early removal of stop words. We show that Arabic Categorization,” Proceedings of the 2005 International
neglected stop words can be highly important and can provide a Conference on Data Mining, Las Vegas USA, 2005.
significant improvement to processing Arabic documents. The
approach also can reduce English documentation stemming errors [9] M. EL KOURDI, A. BENSAID, and T. RACHIDI,
due to the prior knowledge of nouns and proper nouns[20]. “Automatic Arabic Document Categorization Based on the
Stemming error is a subjective measure that does not get much Naïve Bayes Algorithm,” COLING 2004.
attention in comparing Arabic stemmers. We perform initial
experiments on the output of our lemmatizer and on the Khoja
stemmer’s output and find out that the Khoja stemmer have a high [10] S.H. Mustafa and Q.A. Al-Radaideh, “Using N-grams for
over-stemming error rate. Arabic text searching,” Journal of the American Society for
Information Science and Technology, vol. 55, 2004, pp.
Our experiments show a promising future for our lemmatizer, 1002-1007.
which encourages us to apply further research on clustering larger
number of documents. Currently we are working on studying
lemmatization on Arabic document classification. Also, we [11] R.A. Baeza-Yates, “Text-Retrieval: Theory and Practice,”
would like to study the lemmatizer effect on precision/recall, as North-Holland Publishing Co., 1992, pp. 465-476.
well as stemming cost, in comparison to the Khoja stemmer and
other Arabic stemming Algorithms.
[12] “Snowball: A language for stemming algorithms”;
http://snowball.tartarus.org/texts/introduction.html.

7. REFERENCES
[13] S.S. Al-Fedaghi and F. Al-Anzi, “A New Algorithm to
Generate Arabic Root-Pattern Forms,” Proceedings of the
[1] W.B. Frakes, “Stemming algorithms,” 1992.
11th National Computer Conference and Exhibition, 1989,
pp. 391–400.
[2] I.A. Al-Kharashi, “Micro-AIRS: A microcomputer-based
Arabic information retrieval system comparing words,
[14] T. Korenius et al., “Stemming and lemmatization in the
stems, and roots as index terms,” 1991.
clustering of finnish text documents,” Washington, D.C.,
USA: ACM, 2004, pp. 625-633.

117
[15] M. BOOT, “Homography and Lemmatization in Dutch
Texts,” ALLC Bulletin, vol. 8, 1980, pp. 175-189. [19] Y. Zhao and G. Karypis, “Criterion Functions for Document
Clustering,” Experiments and Analysis University of
[16] Eiman Al-Shammari and J. Lin, “Automated Corpora Minnesota, Department of Computer Science/Army HPC
Creation Using A Research Center.
novel Arabic Stemming Algorithm,” The 2008
International Symposium on Using Corpora in Contrastive [20] E. Al-Shammari, “Towards an Error Free Stemming,” IADIS
and Translation Studies (UCCTS), Hangzhou, China: European Conference on Data Mining (ECDM 2008),
2008. Amsterdam, The Netherlands: 2008.

[17] A.K. Jain and R.C. Dubes, Algorithms for clustering data,
1988.

[18] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of


document clustering techniques,” KDD Workshop on Text
Mining, vol. 34, 2000, p. 35.

118

You might also like