Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views7 pages

Arabic Text Classification Algorithm Using TFIDF A

The document presents a new Arabic text classification algorithm utilizing TFIDF and Chi Square measurements to categorize documents into predefined categories. It highlights the challenges of Arabic text classification due to the language's complex structure and proposes a two-stage method for categorization and classification, achieving a classification accuracy of 62.23%. The study emphasizes the importance of automatic text categorization in managing the growing volume of Arabic textual information.

Uploaded by

avjitp76
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

Arabic Text Classification Algorithm Using TFIDF A

The document presents a new Arabic text classification algorithm utilizing TFIDF and Chi Square measurements to categorize documents into predefined categories. It highlights the challenges of Arabic text classification due to the language's complex structure and proposes a two-stage method for categorization and classification, achieving a classification accuracy of 62.23%. The study emphasizes the importance of automatic text categorization in managing the growing volume of Arabic textual information.

Uploaded by

avjitp76
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/271157053

Arabic Text Classification Algorithm using TFIDF and Chi Square


Measurements

Article in International Journal of Computer Applications · May 2014


DOI: 10.5120/16223-5674

CITATIONS READS

46 2,616

1 author:

Aymen Abu-Errub
Jadara University
13 PUBLICATIONS 182 CITATIONS

SEE PROFILE

All content following this page was uploaded by Aymen Abu-Errub on 17 November 2017.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 93 – No 6, May 2014

Arabic Text Classification Algorithm using TFIDF and


Chi Square Measurements

Aymen Abu-Errub
Assistant Professor
Department of Network and
Information Security, Faculty of IT,
Al-Ahliyya Amman University,
Amman, Jordan

ABSTRACT such as email classification, information retrieval and junk


Text categorization is the process of classifying documents email filtering. There are two main approaches for text
into a predefined set of categories based on its contents of categorization: the knowledge engineering approach and the
keywords. Text classification is an extended type of text supervised learning approach. In the knowledge engineering
categorization where the text is further categorized into sub- approach, the classification rules are manually created by
categories. Many algorithms have been proposed and domain experts. While in the supervised learning approach,
implemented to solve the problem of English text machine learning techniques automatically build the
categorization and classification. However, few studies have classifiers based on a set of labeled documents. Technically,
been carried out for categorizing and classifying Arabic text. for each input document d and category c, text classification
Compared to English, the Arabic text classification is involves two steps: (1) estimating the extent to which d shares
considered as a very challenging due to the Arabic language semantics with c, and (2) based on the estimation, deciding
complex linguistic structure and its highly derivational nature whether d may be classified into c. In the first step, classifiers
where morphology plays a very important role. This paper need to assess the similarity score of d with respect to c. In the
proposes a new method for Arabic text classification in which second step, classifiers need to associate a threshold to c. If
a document is compared with pre-defined documents the similarity score of d with respect to c is higher than or
categories based on its contents using the TF.IDF method equal to the threshold of c, d is classified into c , otherwise it
(Term Frequency times Inverse Document Frequency) is not classified into c [8, 10, 16].
measure, then the document is classified into the appropriate Nowadays, due to the increasing amount of valuable textual
sub-category using Chi Square measure.. information, it becomes a problem or a challenge for humans
to manually manipulate this huge amount of information and
General Terms identify the most relevant information or knowledge.
Information Retrieval. Therefore, Automatic text categorization plays an important
role in helping information users overcome such a challenge
Keywords by reducing the time needed to classify thousands of daily
Text Categorization, Text Classification, Term Frequency, arrived documents, without the need for experts. Thus,
Inverse Document Frequency, Chi Square. Automatic TC can significantly reduce the cost and effort of
manual categorization [3]. For example, it has been reported
1. INTRODUCTION in the Internet World Stats
According to the Internet World Stats
(http://www.internetworldstats.com/stats7.htm) that the
(http://www.internetworldstats.com/stats7.htm), Arabic is the
number of Arabic speaking Internet users has grown 2,501.2
fifth most used language in the world, spoken by almost 340
% in the last eleven years (2000-2011), which is the highest
million people in 27 states. The Arabic language is one of the
growth rate among other languages. Consequently, with the
oldest known spoken languages as well as one of the official
increasing amount of Arabic textual information, there is a
languages of the United Nations. It belongs to the Semitic
need to propose, develop and employ Arabic TC algorithms in
language family originated in the Arabian Peninsula in pre-
order to store and divide textual information into categories.
Islamic times, and spread rapidly across the Middle East [4].
Thus, assisting the Arabic users to navigate to the information
The Arabic language is very interesting in terms of its history,
they would like to obtain.
the strategic value of its people and the region they live in,
and its cultural legacy. Historically, for more than fifteen Compared to other languages, there is still a limited research
centuries, classical Arabic remained unaffected, which has been carried out for the Arabic text categorization
comprehensible and functional. At the Strategic level, Arabic and classification due to the complex and rich nature of the
is the native language of almost 340 million speakers Arabic language and its highly derivational nature where
occupying a main region with vast oil reserves important to morphology plays a very important role [6, 14]. Additionally,
the world economy. Culturally, the Arabic language is closely most of such research includes supervised machine learning
associated with Islam in which 1.4 billion Muslims perform techniques in which most of these techniques have complex
their prayers five times daily [6]. mathematical models and do not usually lead to accurate
results for Arabic TC [14]. Accordingly, much more research
Text categorization (TC) refers to the process of classifying
is needed to further develop and refine the area of Arabic TC.
free text, to predefined categories, assigning to it one or more
In this study, the researcher will apply both a vector
category label. TC has been widely employed in many areas
classification method and Chi square measurement for Arabic

40
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No 6, May 2014

text classification in which similar documents are grouped document categorization. To validate the proposed
into categories and sub-categories based on their contents. classification algorithm, the authors created a corpus of 300
The rest of this paper is organized as follows: In Section 2, a documents belong to 10 categories which were selected based
brief overview of the related studies in which a number of on the most popular categories from many newspaper articles
research papers that deal with the area of Arabic TC and collected from different online newspaper websites. The
Arabic root extraction are reviewed. Section 3 shows the experimental study shows the success of the proposed
proposed algorithm. Section 4 presents the experimental classifier in terms of error rate, accuracy, and micro-average
results. Finally, conclusions and directions for future study are recall measures, and achieves 62.23% of classification
provided in Section 5. accuracy.

2. RELATED WORKS Alsaleem [5] discussed the problem of automatically


Many researches have been carried out on text categorization classifying Arabic text documents and used Naïve Bayesian
in English. However, researches on text categorization for method (NB) and Support Vector Machine algorithm (SVM)
Arabic language are quite limited [6, 14]. Among the on different Arabic data sets to handle the Arabic text
successful approaches for Arabic Text categorization, a classification problem. The Experimental results against
number of recent studies have been proposed [1, 2, 5, 7, 9, 11- different SNP Arabic text categorization data sets confirm that
15, 17]. SVM algorithm outperforms the NB with regards to F1,
Recall and Precision measures.
In the paper of Syiam et al. [15], an intelligent Arabic text
categorization model is presented. For Arabic text Molijy et al. [12] proposed and implemented an automatic
categorization, the proposed model uses: 1) statistical n-gram Arabic document indexing method to automatically create and
stemmer for document pre-processing, 2) a hybrid approach Index Arabic books. The proposed method depends mainly on
of Document Frequency Thresholding and Information Gain text summarization and abstraction processes to gather main
for feature selection, 3) normalized TF-IDF for term topics and statements in the book. It is start by the pre-
weighting, and 4) Rocchio classifier for classification. processing step which removes irrelevant text (e.g.
Experimental results demonstrate the effectiveness of the punctuation marks, diacritics, non-letters, etc.). Then it
proposed model and gives generalization accuracy of about computes the frequency of every term in the document and
98%. reorders them in a descending order. After that, a ranking
algorithm is used to remove all terms with highest and lowest
Mesleh [11] implemented a text classification system for frequency. Finally, the system matches between the term and
Arabic language articles. The implemented system uses 1) the page number where the term occurs in the document and
CHI statistics as a feature extraction method in the pre- automatically adds the index to the end of the document.
processing step of the text classification system design Experimental results in terms of accuracy and performance
procedure, and 2) Support Vector Machines (SVMs) show that the proposed method can effectively replace the
classification model for TC tasks for Arabic language articles. human time consuming effort for indexing a large number of
The author collected corpus from online Arabic newspaper documents or books.
archives, including Al-Jazeera, Al-Nahar, Al-hayat, Al-
Ahram, and Al-Dostor in addition to a few other specialized Al-Diabat [1] investigated the problem of Arabic text
websites. The collected corpus contains 1445 documents categorization using different rule-based classification data
which were falling into 9 classification categories. mining algorithms. These algorithms, which have been
Experimental results show a high classification effectiveness contrasted on the problem of Arabic text classification, are:
for Arabic data set in term of F-measure (F=88.11) compared One Rule, rule induction (RIPPER), decision trees (C4.5), and
to other classification methods. hybrid (PART). Inclusive experiments have been carried out
against known Arabic text collection called SPA with respect
Al-Harbi et al. [2] presented the results of experiments of to different evaluation measures such as error rate, number of
document classification performed on seven different Arabic rules, etc, to determine the best performing algorithm in
corpora using statistical methodology. A tool was regards to the Arabic text classification problem. The results
implemented for feature extraction and selection and the show that the most applicable algorithm is the hybrid
performance of two popular classification algorithms (SVM approach of PART in which it achieved better performance
and C5.0) in classifying the seven Arabic corpora has been that the rest of the algorithms.
evaluated. Generally, C5.0 classifier shows better
classification accuracy than SVM. Zaki et al. [17] proposed a hybrid approach based on n-grams
and the OKAPI model for the indexing and classification of
Harrag et al. [9] enhanced Arabic text classification by feature an Arabic corpus. The hybrid approach takes into account the
selection based on hybrid approach of Document Frequency concept of the semantic vicinity of terms and the use of a
Thresholding using an embedded information gain criterion of radial basis modelling. The use of semantic terminological
the decision tree algorithm. Experiments are conducted over resources such as semantic graphs and semantic dictionaries
two self collected data corpus. The first corpus is a set of significantly improves the process of indexing and
Arabic texts from the Arabian scientific encyclopedia “Do classification. The hybridization of NGRAMs-OKAPI
You Know”. It contains 373 documents fitting in 8 categories. statistical measures and a kernel function is used to calculate
The second corpus is a set of prophetic traditions collected the similarity between terms in order to identify the relevant
from Prophetic encyclopedia “The Nine Book”. It contains concepts which represent best a document.
435 documents fitting in 14 categories. The study
demonstrated the effectiveness of proposed classifier and Goweder et al. [7] developed a Centroid-based technique for
gives classification accuracy of 0.93% for the scientific Arabic text classification. The proposed algorithm is
corpus and 0.91% for the literary corpus. evaluated using a corpus containing a set of 1400 Arabic text
documents covering seven distinct categories. The
Noaman et al. [13] introduced the use of rooting algorithm experimental results show that the adapted Centroid-based
with Naïve Bayes classifier to resolve the problem of Arabic algorithm is applicable to classify Arabic documents. The

41
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No 6, May 2014

performance criteria of the implemented Arabic classifier Table 1. Arabic Stop Words
achieved approximately figures of 90.7%, 87.1%, 88.9%,
94.8%, and 5.2% of Micro-averaging recall, precision, F-
measure, accuracy, and error rates respectively.
Sharef et al. [14] introduced a new Frequency Ratio
Accumulation Method (FRAM) approach for the Arabic TC.
The proposed approach has a simple mathematical model and
it combines the categorization task with the feature selection
task. The combination leads to reduce the computational
operations of Arabic TC system unlike the other methods
which deal with feature selection and classification as a major
process of automated TC. The performance of FRAM
classifier is compared with three classifiers based on the
Bayesian theorem, namely the Simple NB, Multi-variant
Bernoulli Naïve Bayes (MNB) and Multinomial Naïve Bayes
models (MBNB). Experimental results show that the FRAM
has outperformed the simple NB, MNB and MBNB which are
the major methods of supervised Machine Learning. The
FRAM achieved 95.1% macro-F1 value by using unigram
word-level representation method.
Yousef et al. [14] introduced a new technique to extract
Arabic word's roots using N-gram method. The proposed
algorithm consists of several steps; it starts by normalizing the
word, then dividing it into bi-grams and calculate the
similarity between the original words and candidate roots
select from the roots list. The researchers tested their
algorithm on a 141 roots corpus. The results show that the
proposed algorithm is capable of extracts the most possible
roots of nearly 80% of the strong roots.

3. PROPOSED ALGORITHM
The proposed algorithm is consists of two stages;
categorization stage and classification stage. Following is the
description of each stage.
Categorization Stage: in this stage the tested document is
categorized into one of 10 categories. The categorization
process is done by comparing the key words of the test
document with the key words of each category, by using
TF.IDF measurement, and then the most related category is
chosen. Steps 1 to 6 of the proposed algorithm represent this
stage.
Classification Stage: in this stage a further comparing
process is done, this time between the tested document and
the documents in sub-categories of the chosen main
category. The comparing process is done by using Chi
square measurement to find the index words of each sub-
category of the main category. This stage is represented by
steps 7 and 8 of the proposed algorithm. Following are the
steps of the proposed algorithm:
Step1 (Categorization Stage): Delete stop words from the
tested document. Table (1) shows some Arabic stop words:

42
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No 6, May 2014

Step2: Normalize the rest of the tested document: this step


consists of several processes such as:

 Removing punctuation.
 Deleting numbers, spaces and single letters.
 Converting the letters ( ‫) ء‬, ( ‫) ؤ‬, ( ‫) ئـ‬, ( ‫) أ‬, ( ‫ ) إ‬to ( ‫) ا‬
and ( ‫ ) ة‬to ( ‫) ه‬.
Step3: Apply stemming process to the tested document's
words to delete affixes (suffixes and prefixes) letters and
extract the root of each word in the document.
Step4: Find the index terms of both the testing documents
and the tested document by calculating the weight of each
word using TFIDF - Term Frequency (tfij) and the Inverse
Document Frequency (log (N/dfj))- measurement as shown
in the following equation:
Wij = tfij * log (N/dfj)
Step5: Choose the top three words that have the largest
weight in the tested document (index words).
Step6: Compare the index words of the tested document with
the index words of each testing category to find the most
suitable main category.
Step7 (Classification Stage): Calculate the weight of each
word in the main category chosen in step 6 using Chi square
measurement to select the index words of each sub-category.
This step is done by applying the following equation:

N  (AD - CB)2
 2
(w, s i ) 
(A  C)  (B  D)  (A  B)  (C  D)

Where:
w: The word to be weighted.
si: The ith sub-category.
N: Total number of documents in the main category.
A: Number of documents in sub-category si that
containing word w.
Figure 1: Arabic Text Classification Algorithm
B: Number of documents in sub-category si that do not
containing word w. 4. EXPERIMENTAL RESULT
C: Number of documents not in sub-category si but To examine the proposed algorithm the researcher chose a
containing word w. training set of Arabic articles covering different topics. These
documents are categorized into ten main categories and 50
D: Number of documents neither in sub-category si nor subcategories containing 1090 documents with variant size
containing word w and content. These documents are used as learning document
set to apply the proposed algorithm. Table 2 shows both the
Step8: Calculate the similarity between the index words of
main and sub-categories of the training set and the number of
the tested document and the index words of each sub-
documents in each category:
category of the chosen main category.
Step9: Return the name of main and sub category the has
highest matching percentage

43
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No 6, May 2014

Table2: Main and Sub Categories of the Training Table 3: Sample of Proposed Algorithm Percentage
Documenting Set. Results for Categorization Stage

Table 4: Sample of Proposed Algorithm Percentage


Results for Classification Stage

5. CONCLUSION
This research introduced a dual-stages Arabic text
classification algorithm using TFIDF measurement for
categorization stage and Chi square measurement for
classification stage. The researcher examines the proposed
algorithm using 1090 testing (training) documents categorized
into ten main categories and 50 sub categories. The tested
documents set was consists of 1100 different documents. The
experimental results show that the proposed algorithm is
capable of classifying the tested documents to its appropriate
sub category.

6. REFERENCES
[1] A. Alatabbi, and C. S. Iliopoulos, "Morphological
analysis and generation for Arabic language." pp. 1-9.
[2] A. Farghaly, and K. Shaalan, “Arabic Natural Language
Processing: Challenges and Solutions,” ACM
Transactions on Asian Language Information Processing,
vol. 8, no. 4, pp. 1-22, 2009.
[3] R. Guzmán-Cabrera, M. Montes-y-Gómez, P. Rosso et
al., “Using the Web as corpus for self-training text
categorization,” Information Retrieval, vol. 12, no. 3, pp.
400-415, 2009.
[4] A. H. Wahbeh, and M. Al-Kabi, “Comparative
Assessment of the Performance of Three WEKA Text
Classifiers Applied to Arabic Text,” Abhath Al-
Yarmouk: Basic Sci. & Eng., vol. 21, no. 1, pp. 15-28,
2012.
[5] R. L. Liu, “Context recognition for hierarchical text
classification,” Journal of the American society for
information science and technology, vol. 60, no. 4, pp.
803-813, 2009.
[6] R. Al-Shalabi, G. Kanaan, and M. Gharaibeh, "Arabic
The Proposed algorithm was implemented at a set of tested
text categorization using kNN algorithm." pp. 1-9.
documents consist of 1100 document. The results of test show
that the proposed algorithm is capable of categorize the tested [7] B. Sharef, N. Omar, and Z. Sharef, “An Automated
documents to a main category and then classify these tested Arabic Text Categorization Based on the Frequency
documents into a suitable sub-category. Table (3) shows the Ratio Accumulation,” International Arab Journal of
results of categorizing selected tested documents to main Information Technology (IAJIT), vol. 11, no. 2, pp. 213-
category. Table (4) shows the results of classifying tested 221, 2014.
documents into sub-category of the main category

44
International Journal of Computer Applications (0975 – 8887)
Volume 93 – No 6, May 2014

[8] A. Goweder, M. Elboashi, and A. Elbekai, "Centroid- [13] M. M. Syiam, Z. T. Fayed, and M. Habib, “An intelligent
Based Arabic Classifier." pp. 1-8. system for Arabic text categorization,” International
Journal of Intelligent Computing and Information
[9] A. A. Molijy, I. Hmeidi, and I. Alsmadi, “Indexing of Sciences, vol. 6, no. 1, pp. 1-19, 2006.
Arabic documents automatically based on lexical
analysis,” International Journal on Natural Language [14] S. Al-Harbi, A. Almuhareb, A. Al-Thubaity et al.,
Computing, vol. 1, no. 1, pp. 1-8, 2012. "Automatic Arabic text classification." pp. 77-83.
[10] M. Al-diabat, “Arabic Text Categorization Using [15] A. M. d. A. Mesleh, “Chi Square Feature Extraction
Classification Rule Mining,” Applied Mathematical Based Svms Arabic Language Text Categorization
Sciences, vol. 6, no. 81, pp. 4033-4046, 2012. System,” Journal of Computer Science, vol. 3, no. 6,
2007.
[11] S. Alsaleem, “Automated Arabic Text Categorization
Using SVM and NB,” Int. Arab J. e-Technol., vol. 2, no. [16] F. Harrag, E. El-Qawasmeh, and P. Pichappan,
2, pp. 124-128, 2011. "Improving arabic text categorization using decision
trees." pp. 110-115.
[12] T. Zaki, D. Mammass, A. Ennaji et al., "Arabic
Documents Classification by a Radial Basis [17] H. M. Noaman, S. Elmougy, A. Ghoneim et al., "Naive
Hybridization." pp. 37-44. Bayes Classifier based Arabic document categorization."
pp. 1-5.

IJCATM : www.ijcaonline.org 45

View publication stats

You might also like