Genetic Algorithm and Confusion Matrix For Document Clustering
Genetic Algorithm and Confusion Matrix For Document Clustering
2
Research Scholar, Bharathiar University, Coimbatore – 638401,India.
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 2, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 323
have been proposed and applied successfully to solve a clustering based on hybrid GAs can be more efficient, but
wide variety of optimization problems. These studies show these techniques can still, however, suffer from premature
that pure GAs are not well suited to fine tuning structures convergence. Furthermore, all of the above methods may
in complex search spaces and that hybridization with other exhibit limited performance, since they perform clustering
techniques can greatly improve their efficiency. S. Wu et on all features without selection. GAs have also been
al.[3] have proposed about data clustering is a common proposed for feature selection [7]. However, they are
technique for statistical data analysis and has been used in usually developed in the supervised learning context,
a variety of engineering and scientific disciplines such as where class labels of the data are available, and the main
biology (genome data). Y. Zhao and G. Karypis [5] have purpose is to reduce the number of features used in
proposed the purity of a cluster represents the fraction of classification while maintaining acceptable classification
the cluster corresponding to the largest class of documents accuracies. The second (and related) theme is feature
assigned to that cluster; thus, the purity of the cluster. selection for clustering, and feature selection research has
a long history, as reported in the literature.
One way of approaching this challenge is to use
stochastic optimization schemes, prominent among which Feature selection in the context of supervised
is an approach based on genetic algorithms (GAs). The learning, adopts methods that are usually divided into two
GA is biologically inspired and embodies many classes filters and wrappers based on whether or not
mechanisms mimicking natural evolution. It has a great feature selection is implemented independently of the
deal of potential in scientific and engineering optimization learning algorithm. To maintain the filter/wrapper
or search problems. Recently, hybrid methods [8], which distinction used in supervised feature selection, we also
incorporate local searches with traditional GAs, have been classify feature selection methods for clustering into these
proposed and applied successfully to solve a wide variety two categories based on whether or not the process is
of optimization problems. These studies show that pure carried out independently of the clustering algorithm [13,
Gas [16] are not well suited to finetuning structures in 14, 15]. The filters in clustering basically preselect the
complex search spaces and that hybridization with other features and then apply a clustering algorithm to the
techniques can greatly improve their efficiency. GAs that selected feature subset. The principle is that any feature
have been hybridized with local searches are also known carrying little or no additional information beyond that
as memetic algorithms (MAs) [7]. subsumed by the remaining features is redundant and
should be eliminated.
Traditional GAs and MAs are generally suitable
for locating the optimal solution of an optimization 3. Document Clustering
problem with a small number of local optima. Complex While document clustering can be valuable for
problems such as clustering, however, often involve a categorizing documents into meaningful groups, the
significant number of locally optimal solutions. In such usefulness of categorization cannot be fully appreciated
cases, traditional GAs and MAs cannot maintain without labeling those clusters with the relevant keywords
controlled competitions among the individual solutions or key phrases that describe the various topics associated
and can cause the population to converge prematurely [3]. with them. A highly accurate key phrase extraction
To improve the situation, various methods [7], (usually algorithm, called Core Phrase is proposed for this
called niche methods) have been proposed. The research particular purpose.
reported shows that one of the key elements in finding the Core Phrase works by building a complete list of
optimal solution to a difficult problem with a GA phrases shared by at least two documents in a cluster.
approach is to preserve the population diversity during the Phrases are assigned scores according to a set of features
search, since this permits the GA to investigate many calculated from the matching process. The candidate
peaks in parallel and helps in preventing it from being phrases are then ranked in descending order and the top L
trapped in local optima. GAs are naturally applicable to phrases are output as a label for the cluster. While this
problems with exponential search spaces and have algorithm on its own is useful for labeling document
consequently been a significant source of interest for clusters, it is used to produce cluster summaries for the
clustering [6, 10]. For example, in [4] proposed the use of collaborative clustering algorithm.
traditional GAs for partitioned clustering. These methods Document clustering is used to organize a large
can be very expensive and susceptible to becoming document collection into distinct groups of similar
trapped in locally optimal solutions for clustering large documents. It discerns general themes hidden within the
data sets. corpus. Applications of document clustering go beyond
organizing document collections into knowledge maps.
In [8] introduced hybrid GAs by incorporating This can facilitate subsequent knowledge retrievals and
clustering-specified local searches into traditional GAs. In accesses. Document clustering, for example, has been
contrast to the methods proposed in [11] and [12], applied to improve the efficiency of text categorization
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 2, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 324
and discover event episodes in temporally ordered A confusion matrix contains information about
documents. In addition, instead of presenting search actual and predicted classifications done by a classification
results as one long list, some prior studies and emerging system. Performance of such systems is commonly
search engines employ a document clustering approach to evaluated using the data in the matrix. The following table
automatically organize search results into meaningful shows the confusion matrix for a two class classifier.
categories and thereby support cluster-based browsing.
Predicted
Positive c d
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 2, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 325
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 2, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 326
clustering solutions. We compute classification errors, stopword removal)5; (3) at this point we have randomly
since we know the “true” clusters of the synthetic data and split the set of seen data into a training set (70%), on
the class labels of the real data. This is done by first which to run the GA, and a validation set (30%), on which
running the algorithm to be tested on each data set. Next, tuning the model parameters. We performed the split in
each cluster of the clustering results is assigned to a class such a way that each category was proportionally
based on examining the class labels of the data objects in represented in both sets (stratified holdout). Based on the
that cluster and choosing the majority class. After that, the term frequency and inverse document frequency, the term
classification errors are computed by counting the number weight will be calculated.
of misclassified data objects. For the identification of
correct clusters, initially we report the number of clusters
found. We stress that the class labels are not used during
the generation of the clustering results, and they are
intended only to provide independent verification of the
clusters.
The feature recall and precision are reported on
synthetic data, since the relevant features are known a
priori. Recall and precision are concepts from text
retrieval. Feature recall is the number of relevant features Fig 3: Confusion matrix on text documents
in the selected subset divided by the total number of
relevant features. Feature precision is the number of The performance results are measured in terms of
relevant features in the selected subset divided by the total F-measure, purity and false positive rate according to the
number of features selected. These indices give us an Number of documents and Cluster object.
indication of the quality of the features selected. High 0.8
0.4
unknown. 0
0 5 10 15 20 25 30
Cluster object
0.16
0.14
False positive rate
0.12
0.1
0.08
0.06
Fig 2 Clustered DataSet 0.04
0.02
0
Preliminarily, documents were subjected to the 0 5 10 15 20 25 30
words occurring in a list of common stopwords, as well as Improved GA Niching memetic algorithm
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 2, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 327
Figure (4), (5) and (6) shows the result of F- confusion matrices and uses the F-measure as the fitness
measure, purity and false positive rate with respect to the function. In this case, identifying a relevant subset that
cluster object. The proposed algorithm improved GA gives adequately captures the underlying structure in the data
the better result compared with existing method, Niching can be particularly useful. Additionally, as a general
memetic algorithm. optimization framework, the proposed algorithm can be
applied for text mining. In such a case, an unbiased
clustering criterion in some sense is produced by
0.6
computing the mutual information between clusters, thus
enabling a better verification of the properties of the
F - measure
0.4
proposed optimization scheme. We conclude by remarking
0.2 that we consider the experimental results can further be
improved through a fine-tuning of the GA parameters.
0
0 20 40 60 80 100 References
No. of Documents
Improved GA Niching memetic algorithm [1] A.K. Santra, C. Josephine Christy and B.Nagarajan,
“Cluster Based Hybrid Niche Memetic and Genetic
Fig 7: No. of Documents vs F-measure Algorithm for Text Document Categorization”, IJCSI,
100
vol.8, Issue 5, no. 2,pp. 450-456, Sep 2011.
80
[2] S. Areibi and Z. Yang, “Effective Memetic Algorithms
purity (%)
0.6
Fig 9: No. of documents vs false positive rate [6] J. Kogan, C. Nicholas, and V. Volkovich, “Text
Mining with Information-Theoretic Clustering,” IEEE
Computational Science and Eng., pp. 52-59, 2003.
Figure (7), (8) and (9) depicts the performance
result of F-measure, purity and false positive rate [7] W. Sheng, A. Tucker, and X. Liu, “Clustering with
according to the Number of documents. It is observed that Niching Genetic K-Means Algorithm,” Proc. Genetic and
improved GA performs the well. By comparing niching Evolutionary Computation Conf. (GECCO ’04), pp. 162-
memetic algorithm with improved GA, proposed improved 173, 2004.
GA can efficiently recover solutions with low
classification errors [8] K. Deep and K. N. Das. Quadratic approximation
based Hybrid Genetic Algorithm for Function
6. CONCLUSION Optimization. AMC, Elsevier, Vol. 203: 86-98, 2008.
The improved Niche memetic algorithm and
improved genetic algorithm have been designed and [9] C. Wei, C.S. Yang, H.W. Hsiao, T.H. Cheng,
implemented by using confusion matrices. Our proposed Combining preference- and content-based approaches for
method is applied to real data sets with an abundance of improving document clustering effectiveness, Information
irrelevant or redundant features. Improved GA relies on Processing & Management 42 (2) (2006) 350–372.
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 2, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 328
[10] Renchu Guan, Xiaohu Shi, Maurizio Marchese, Chen C. Josephine Christy received her M.Sc., M.Phil.,
Yang, and Yanchun Liang, “Text Clustering with Seeds M.B.A., from Bharathiar University, Coimbatore.
Affinity Propagation” IEEE TRANSACTIONS ON Currently she is working as Asst.Professor in Bannari
KNOWLEDGE AND DATA ENGINEERING, VOL. 23, Amman Institute of Technology, Sathyamangalam. Her
NO. 4, APRIL 2011 area of interest includes Text Mining, Web Mining. She
[11] Y.J. Li, C. Luo, and S.M. Chung, “Text Clustering presented a paper in International Journal, 2 papers
with Feature Selection by Using Statistical Data,” IEEE international conferences and 6 papers in national
Trans. Knowledge and Data Eng., vol. 20, no. 5, pp. 641- Conferences. She is a Life member of Computer Society
652, May 2008. of India and a Life member of Indian Society for
Technical Education.
[12] B.J. Frey and D. Dueck, “Non-Metric Affinity
Propagation for Un- Supervised Image Categorization,”
Proc. 11th IEEE Int’l Conf. Computer Vision (ICCV ’07),
pp. 1-8, Oct. 2007.
[13] L.P. Jing, M.K. Ng, and J.Z. Huang, “An Entropy
Weighting KMeans Algorithm for Subspace Clustering of
High-Dimensional Sparse Data,” IEEE Trans. Knowledge
and Data Eng., vol. 19, no. 8, pp. 1026-1041, Aug. 2007.
[14] Z.H. Zhou and M. Li, “Distributional Features for
Text Categorization,” IEEE Trans. Knowledge and Data
Eng., vol. 21, no. 3, pp. 428-442, Mar. 2009.
[15] F. Pan, X. Zhang, and W. Wang, “Crd: Fast Co-
Clustering on Large Data Sets Utilizing Sampling-Based
Matrix Decomposition,” Proc. ACM SIGMOD, 2008.
[16] Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, “A
Fuzzy Self-Constructing Feature Clustering Algorithm for
Text Classification”, IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING, VOL. 23,
NO. 3, MARCH 2011
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.