0% found this document useful (0 votes)

23 views7 pages

Genetic Algorithm and Confusion Matrix For Document Clustering

Uploaded by

Dedo Karmanata

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views7 pages

Genetic Algorithm and Confusion Matrix For Document Clustering

Uploaded by

Dedo Karmanata

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

IJCSI International Journal of Computer Science Issues, Vol.

9, Issue 1, No 2, January 2012

ISSN (Online): 1694-0814
www.IJCSI.org 322

Genetic Algorithm and Confusion Matrix for Document Clustering

A. K. Santra, C. Josephine Christy,

1
Dean, CARE School of Computer Applications , Trichy – 620 009, India.

2
Research Scholar, Bharathiar University, Coimbatore – 638401,India.

that the occurrence of each word in a document is

Abstract conditionally independent of all other words in that
Text mining is one of the most important tools in document given its class.
Information Retrieval. Text clustering is the process of
classifying documents into predefined categories The confusion matrix is more commonly named
according to their content. Existing supervised learning contingency table in which the matrix could be arbitrarily
algorithms to automatically classify text requires sufficient large. The number of correctly classified instances is the
documentation to learn exactly. In this paper, Niching sum of diagonals in the matrix; all others are incorrectly
memetic algorithm and Genetic algorithm (GA) is classified accurately. Improved Genetic algorithm starts
presented in which feature selection an integral part of the with an initial population which is created consisting of
global clustering search procedure that attempts to randomly generated rules. Each rule can be represented by
overcome the problem of finding optimal solutions at the a string of bits. Based on the notion of survival of the
local less promising in both clustering and feature fittest, a new population is formed to consist of the fittest
selection. The concept of confusion matrix is then used for rules in the current population, as well as offspring of
derivative works, and finally, hybrid GA is included for these rules. Typically, the fitness of a rule is assessed by
the final classification. Experimental results show benefits its classification accuracy on a set of training examples.
by using the proposed method which evaluates F-measure,
purity and results better performance in terms of False This paper presents an improved genetic
positive, False negative, True positive and True negative. algorithm which is used to evaluate the weights of the
metrics such as F-measure, purity and accuracy. We apply
Keywords: Text mining, GA, Confusion matrix, F- improved genetic algorithm to find out and identify the
measure potential informative features combinations for
classification and then use the F-Measure to determine the
1. Introduction fitness in genetic algorithm. The improved GA is general
purpose search algorithm which provides rules inspired by
In Text data mining, Text classification has natural genetic populations to evaluate solutions to
become one of the most important techniques. The task is problems. In our method, not as usual, an individual is
to automatically classify documents into predefined joined together of the real-coded metrics’ weight, and it’s
classes based on their content. Many algorithms have been more natural to indicate the optimization problem in the
developed to deal with document clustering. With the continuous domain.
existing algorithms, a number of newly established
processes are involving in the automation of Document 2. Literature Review
clustering. It has been observed that for the purpose of
Document clustering the concept of association rule is A. K. Santra, C. Josephine Christy and B.
very well known. Association rule mining finds interesting Nagarajan [1] have proposed that cluster based niche
association or correlation relationships among a large set memetic and genetic algorithm have been designed &
of data items. The discovery of these relationships among implemented by optimizing feature selection of text in the
huge amounts of transaction records can help in many document repository. The contribution of genetic
decision making process. On the other hand, the confusion algorithm works with an evaluation of fitness function.
matrices use the maximum a posteriori estimation for Accuracy can be calculated through the document
learning a classifier. It assumes clustering. S. Areibi and Z. Yang [2] have proposed
several local search operations to effectively design an
MA for simultaneous clustering and feature selection.
which incorporate local searches with traditional GAs,

Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 2, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 323

have been proposed and applied successfully to solve a clustering based on hybrid GAs can be more efficient, but
wide variety of optimization problems. These studies show these techniques can still, however, suffer from premature
that pure GAs are not well suited to fine tuning structures convergence. Furthermore, all of the above methods may
in complex search spaces and that hybridization with other exhibit limited performance, since they perform clustering
techniques can greatly improve their efficiency. S. Wu et on all features without selection. GAs have also been
al.[3] have proposed about data clustering is a common proposed for feature selection [7]. However, they are
technique for statistical data analysis and has been used in usually developed in the supervised learning context,
a variety of engineering and scientific disciplines such as where class labels of the data are available, and the main
biology (genome data). Y. Zhao and G. Karypis [5] have purpose is to reduce the number of features used in
proposed the purity of a cluster represents the fraction of classification while maintaining acceptable classification
the cluster corresponding to the largest class of documents accuracies. The second (and related) theme is feature
assigned to that cluster; thus, the purity of the cluster. selection for clustering, and feature selection research has
a long history, as reported in the literature.
One way of approaching this challenge is to use
stochastic optimization schemes, prominent among which Feature selection in the context of supervised
is an approach based on genetic algorithms (GAs). The learning, adopts methods that are usually divided into two
GA is biologically inspired and embodies many classes filters and wrappers based on whether or not
mechanisms mimicking natural evolution. It has a great feature selection is implemented independently of the
deal of potential in scientific and engineering optimization learning algorithm. To maintain the filter/wrapper
or search problems. Recently, hybrid methods [8], which distinction used in supervised feature selection, we also
incorporate local searches with traditional GAs, have been classify feature selection methods for clustering into these
proposed and applied successfully to solve a wide variety two categories based on whether or not the process is
of optimization problems. These studies show that pure carried out independently of the clustering algorithm [13,
Gas [16] are not well suited to finetuning structures in 14, 15]. The filters in clustering basically preselect the
complex search spaces and that hybridization with other features and then apply a clustering algorithm to the
techniques can greatly improve their efficiency. GAs that selected feature subset. The principle is that any feature
have been hybridized with local searches are also known carrying little or no additional information beyond that
as memetic algorithms (MAs) [7]. subsumed by the remaining features is redundant and
should be eliminated.
Traditional GAs and MAs are generally suitable
for locating the optimal solution of an optimization 3. Document Clustering
problem with a small number of local optima. Complex While document clustering can be valuable for
problems such as clustering, however, often involve a categorizing documents into meaningful groups, the
significant number of locally optimal solutions. In such usefulness of categorization cannot be fully appreciated
cases, traditional GAs and MAs cannot maintain without labeling those clusters with the relevant keywords
controlled competitions among the individual solutions or key phrases that describe the various topics associated
and can cause the population to converge prematurely [3]. with them. A highly accurate key phrase extraction
To improve the situation, various methods [7], (usually algorithm, called Core Phrase is proposed for this
called niche methods) have been proposed. The research particular purpose.
reported shows that one of the key elements in finding the Core Phrase works by building a complete list of
optimal solution to a difficult problem with a GA phrases shared by at least two documents in a cluster.
approach is to preserve the population diversity during the Phrases are assigned scores according to a set of features
search, since this permits the GA to investigate many calculated from the matching process. The candidate
peaks in parallel and helps in preventing it from being phrases are then ranked in descending order and the top L
trapped in local optima. GAs are naturally applicable to phrases are output as a label for the cluster. While this
problems with exponential search spaces and have algorithm on its own is useful for labeling document
consequently been a significant source of interest for clusters, it is used to produce cluster summaries for the
clustering [6, 10]. For example, in [4] proposed the use of collaborative clustering algorithm.
traditional GAs for partitioned clustering. These methods Document clustering is used to organize a large
can be very expensive and susceptible to becoming document collection into distinct groups of similar
trapped in locally optimal solutions for clustering large documents. It discerns general themes hidden within the
data sets. corpus. Applications of document clustering go beyond
organizing document collections into knowledge maps.
In [8] introduced hybrid GAs by incorporating This can facilitate subsequent knowledge retrievals and
clustering-specified local searches into traditional GAs. In accesses. Document clustering, for example, has been
contrast to the methods proposed in [11] and [12], applied to improve the efficiency of text categorization

and discover event episodes in temporally ordered A confusion matrix contains information about
documents. In addition, instead of presenting search actual and predicted classifications done by a classification
results as one long list, some prior studies and emerging system. Performance of such systems is commonly
search engines employ a document clustering approach to evaluated using the data in the matrix. The following table
automatically organize search results into meaningful shows the confusion matrix for a two class classifier.
categories and thereby support cluster-based browsing.
Predicted

Datas Document Relevant Negative Positive

ets cluster Documen
Actual Negative a b

Positive c d

The entries in the confusion matrix have the following

Confusion meaning in the context of our study:
matrices Sum of diagonals
 a is the number of correct predictions that an
instance is negative,
Feature Improved  b is the number of incorrect predictions that an
Improved Niche selectio GA instance is positive,
memetic algorithm n algorithm
 c is the number of incorrect of predictions that an
instance negative, and

Measured Best  d is the number of correct predictions that an

False positive, combination instance is positive.
False negative,
True positive Several standard terms have been defined for the 2 class
and True matrix:
negative  The accuracy (AC) is the proportion of the total
number of predictions that were correct. It is
Fig 1 : Document Clustering using confusion matrices determined using the equation:
on Improved GA algorithm a+d
AC = -----------> (1)
4. Feature Selection a+b+c+d
The recall or true positive rate (TP) is the proportion of
positive cases that were correctly identified, as calculated
Feature selection is important for clustering using the equation:
efficiency and effectiveness because it not only condenses
the size of the extracted feature set but also reduces any d
potential biases embedded in the original (i.e., non- TP = -----> (2)
trimmed) feature set . Previous research commonly has c+d
employed feature selection metrics such as TF (term
frequency), TF×IDF (term frequency × inverse document  The false positive rate (FP) is the proportion of
frequency), and their hybrids. negatives cases that were incorrectly classified as
positive, as calculated using the equation:
Unlike the non-LSI-based document clustering
approach, which typically involves a feature selection
b
phase, the LSI-based approach to clustering monolingual
FP = ------> (3)
documents employs LSI to reduce the dimensions and
a+b
thereby improve both clustering effectiveness and
efficiency. Its process generally commences with feature
 The true negative rate (TN) is defined as the
extraction, followed by document representation.
proportion of negatives cases that were classified
4.1 Confusion Matrix correctly, as calculated using the equation:

a among dissimilar solutions. The flow of the algorithm is

TN = -------> (4) given as follows:
a+b
Step 1: Initialize the population_size p
 The false negative rate (FN) is the proportion of Step 2: For each p in initial population, p = local search
positives cases that were incorrectly classified as (p)
negative, as calculated using the equation: Step 3: Calculate unified criterion for each of the
offspring. If the fitness of the offspring is better than its
c
paired solution, then the latter is replaced.
FN = ------->(5)
c+d Step 4: Provide the feature subset and cluster centers of
the solution from the terminal population with the best
 Finally, precision (P) is the proportion of the fitness.
predicted positive cases that were correct, as 4.3. GA Algorithm
calculated using the equation:
A genetic algorithm (GA) is a search heuristic
d that mimics the process of natural evolution. This heuristic
P= -------> (6) is routinely used to generate useful solutions to
b+d optimization and search problems. Genetic algorithms
The accuracy determined using equation 1 may belong to the larger class of evolutionary algorithms (EA),
not be an adequate performance measure when the number which generate solutions to optimization problems using
of negative cases is much greater than the number of techniques inspired by natural evolution, such as
positive cases. Suppose there are 1000 cases, 995 of which inheritance, mutation, selection, and crossover. The flow
are negative cases and 5 of which are positive cases. If the of the algorithm is given as follows:
system classifies them all as negative, the accuracy would
be 99.5%, even though the classifier missed all positive Input: document set DS, number of generations n
cases. Other performance measures account for this by Output: best classifier over DS
including TP in a product: for example, geometric
mean (g-mean), as defined in equations 7 and 8, and F- Step 1: Evaluate the sets of candidate positive and
Measure (Lewis and Gale, 1994), as defined in equation 9. negative terms
Step 2: Create the population oldPop and initialize each
g-mean1 = √TP*P --------> (7) chromosome
Step 3: Evaluate the fitness of each chromosome in
g-mean2 = √TP*P --------> (8) oldPop
Step 4: Copy in NewPop the best r chromosomes of
(β2 + 1) * P * TP oldPop
F= --------> (9)
Β2 * P + TP
In equation 9, b has a value from 0 to infinity and is used Step 5: While size(newPop) < size(oldPop)
to control the weight assigned to TP and P. Any classifier - select parent1 and parent2 in oldPop
evaluated using equations 7, 8 or 9 will have a measure - generate kid1, kid2 through crossover(parent1,
value of 0, if all positive cases are classified incorrectly. parent2)
4.2. Niching Memetic Algorithm - apply mutation, i.e., kid1 = mut(kid1) and kid2
One of the key elements in overcoming less = mut(kid2)
promising locally optimal solutions of a difficult - apply the repair operator ρ to both kid1 and kid2
optimization problem with a GA approach is to preserve - add kid1 and kid2 to newPop
the population diversity during the search. In this section, step 6: oldPop = newPop
we introduce a modification of the niching method and - Select the best chromosome K in oldPop;
integrate it into our GA to preserve the population - Eliminate redundancies from K;
diversity during the simultaneous search for clustering and step 7: classifier associated with K.
feature selection.
5. Performance Evalution
The niching method presented was designed for The performance of improved GA on Documents
clustering where no feature selection is required and the is evaluated in this section. Let us suppose that we have
number of clusters is known beforehand. In this method, a obtained a clustering solution with feature selection. Since
niching selection with a restricted competition the quality of clusters depends on the particular
replacement was developed to encourage mating among application, there is no standard criterion for evaluating
similar solutions while allowing for some competitions

clustering solutions. We compute classification errors, stopword removal)5; (3) at this point we have randomly
since we know the “true” clusters of the synthetic data and split the set of seen data into a training set (70%), on
the class labels of the real data. This is done by first which to run the GA, and a validation set (30%), on which
running the algorithm to be tested on each data set. Next, tuning the model parameters. We performed the split in
each cluster of the clustering results is assigned to a class such a way that each category was proportionally
based on examining the class labels of the data objects in represented in both sets (stratified holdout). Based on the
that cluster and choosing the majority class. After that, the term frequency and inverse document frequency, the term
classification errors are computed by counting the number weight will be calculated.
of misclassified data objects. For the identification of
correct clusters, initially we report the number of clusters
found. We stress that the class labels are not used during
the generation of the clustering results, and they are
intended only to provide independent verification of the
clusters.
The feature recall and precision are reported on
synthetic data, since the relevant features are known a
priori. Recall and precision are concepts from text
retrieval. Feature recall is the number of relevant features Fig 3: Confusion matrix on text documents
in the selected subset divided by the total number of
relevant features. Feature precision is the number of The performance results are measured in terms of
relevant features in the selected subset divided by the total F-measure, purity and false positive rate according to the
number of features selected. These indices give us an Number of documents and Cluster object.
indication of the quality of the features selected. High 0.8

values of feature recall and precision are desired. Note 0.6

that, with respect to the real data, we report only the

F-measure

0.4

number of feature selected, since the relevant features are 0.2

unknown. 0
0 5 10 15 20 25 30
Cluster object

6. Experimental Result and Discussion Improved GA Niching memetic algorithm

The proposed method was tested with a file of

100 historical documents. The datasets were taken as Fig 4: Cluster object vs F-measure
related topic of Data mining, Image processing and
100
Networking. For each dataset, 30% of the documents are
randomly selected as test documents, and the rest are used 80
Purity (%)

to create training sets as follows: γ percent of the 60

documents from the positive class is first selected as the 40

positive set P. The rest of the positive documents and 20

negative documents are used as unlabeled set U. We range 0

γ percent from 10%- 50% to create a wide range of 0 5 10 15 20 25 30

Cluster object
scenarios.
Improved GA Niching memetic algorithm

Fig 5: Cluster vs purity

0.16
0.14
False positive rate

0.12
0.1
0.08
0.06
Fig 2 Clustered DataSet 0.04
0.02
0
Preliminarily, documents were subjected to the 0 5 10 15 20 25 30

following pre-processing steps: (1) First, we removed all Cluster object

words occurring in a list of common stopwords, as well as Improved GA Niching memetic algorithm

punctuation marks and numbers; (2) then, we extracted all

n-grams, defined as sequences of maximum three words Fig 6: Cluster Object vs False positive rate
consecutively occurring within a document (after

Figure (4), (5) and (6) shows the result of F- confusion matrices and uses the F-measure as the fitness
measure, purity and false positive rate with respect to the function. In this case, identifying a relevant subset that
cluster object. The proposed algorithm improved GA gives adequately captures the underlying structure in the data
the better result compared with existing method, Niching can be particularly useful. Additionally, as a general
memetic algorithm. optimization framework, the proposed algorithm can be
applied for text mining. In such a case, an unbiased
clustering criterion in some sense is produced by
0.6
computing the mutual information between clusters, thus
enabling a better verification of the properties of the
F - measure

0.4
proposed optimization scheme. We conclude by remarking
0.2 that we consider the experimental results can further be
improved through a fine-tuning of the GA parameters.
0
0 20 40 60 80 100 References
No. of Documents

Improved GA Niching memetic algorithm [1] A.K. Santra, C. Josephine Christy and B.Nagarajan,
“Cluster Based Hybrid Niche Memetic and Genetic
Fig 7: No. of Documents vs F-measure Algorithm for Text Document Categorization”, IJCSI,
100
vol.8, Issue 5, no. 2,pp. 450-456, Sep 2011.
80
[2] S. Areibi and Z. Yang, “Effective Memetic Algorithms
purity (%)

60 for VLSI Design Automation = Genetic Algorithms +

40 Local Search + MultiLevel Clustering,” Evolutionary
20 Computation, vol. 12, no. 3, pp. 327- 353, 2004.
0
0 20 40 60 80 100 [3] S. Wu, A.W.C. Liew, H. Yan, and M. Yang, “Cluster
No. of Documents Analysis of Gene Expression Database on Self-Splitting
Improved GA Niching memetic algorithm and Merging Competitive Learning,” IEEE Trans.
Information Technology in Biomedicine, vol. 8, no. 1,
Fig 8: No. of Documents vs Purity 2004.
0.8
[4] H.K. Tsai, J.M. Yang, Y.F. Tsai, and C.Y. Kao, “An
Evolutionary Approach for Gene Expression Patterns,”
False positive rate

0.6

IEEE Trans. Information Technology in Biomedicine, vol.

0.4
8, no. 2, pp. 69-78, 2004.
0.2

[5] Y. Zhao and G. Karypis, “Empirical and Theoretical

0
0 20 40 60 80 100 Comparisons of Selected Criterion Functions for
No. of Documents Document Clustering,” Machine Learning, vol. 55, no. 3,
Improved GA Niching memetic algorithm pp. 311-331, 2004.

Fig 9: No. of documents vs false positive rate [6] J. Kogan, C. Nicholas, and V. Volkovich, “Text
Mining with Information-Theoretic Clustering,” IEEE
Computational Science and Eng., pp. 52-59, 2003.
Figure (7), (8) and (9) depicts the performance
result of F-measure, purity and false positive rate [7] W. Sheng, A. Tucker, and X. Liu, “Clustering with
according to the Number of documents. It is observed that Niching Genetic K-Means Algorithm,” Proc. Genetic and
improved GA performs the well. By comparing niching Evolutionary Computation Conf. (GECCO ’04), pp. 162-
memetic algorithm with improved GA, proposed improved 173, 2004.
GA can efficiently recover solutions with low
classification errors [8] K. Deep and K. N. Das. Quadratic approximation
based Hybrid Genetic Algorithm for Function
6. CONCLUSION Optimization. AMC, Elsevier, Vol. 203: 86-98, 2008.
The improved Niche memetic algorithm and
improved genetic algorithm have been designed and [9] C. Wei, C.S. Yang, H.W. Hsiao, T.H. Cheng,
implemented by using confusion matrices. Our proposed Combining preference- and content-based approaches for
method is applied to real data sets with an abundance of improving document clustering effectiveness, Information
irrelevant or redundant features. Improved GA relies on Processing & Management 42 (2) (2006) 350–372.

[10] Renchu Guan, Xiaohu Shi, Maurizio Marchese, Chen C. Josephine Christy received her M.Sc., M.Phil.,
Yang, and Yanchun Liang, “Text Clustering with Seeds M.B.A., from Bharathiar University, Coimbatore.
Affinity Propagation” IEEE TRANSACTIONS ON Currently she is working as Asst.Professor in Bannari
KNOWLEDGE AND DATA ENGINEERING, VOL. 23, Amman Institute of Technology, Sathyamangalam. Her
NO. 4, APRIL 2011 area of interest includes Text Mining, Web Mining. She
[11] Y.J. Li, C. Luo, and S.M. Chung, “Text Clustering presented a paper in International Journal, 2 papers
with Feature Selection by Using Statistical Data,” IEEE international conferences and 6 papers in national
Trans. Knowledge and Data Eng., vol. 20, no. 5, pp. 641- Conferences. She is a Life member of Computer Society
652, May 2008. of India and a Life member of Indian Society for
Technical Education.
[12] B.J. Frey and D. Dueck, “Non-Metric Affinity
Propagation for Un- Supervised Image Categorization,”
Proc. 11th IEEE Int’l Conf. Computer Vision (ICCV ’07),
pp. 1-8, Oct. 2007.
[13] L.P. Jing, M.K. Ng, and J.Z. Huang, “An Entropy
Weighting KMeans Algorithm for Subspace Clustering of
High-Dimensional Sparse Data,” IEEE Trans. Knowledge
and Data Eng., vol. 19, no. 8, pp. 1026-1041, Aug. 2007.
[14] Z.H. Zhou and M. Li, “Distributional Features for
Text Categorization,” IEEE Trans. Knowledge and Data
Eng., vol. 21, no. 3, pp. 428-442, Mar. 2009.
[15] F. Pan, X. Zhang, and W. Wang, “Crd: Fast Co-
Clustering on Large Data Sets Utilizing Sampling-Based
Matrix Decomposition,” Proc. ACM SIGMOD, 2008.
[16] Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, “A
Fuzzy Self-Constructing Feature Clustering Algorithm for
Text Classification”, IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING, VOL. 23,
NO. 3, MARCH 2011

A. K. Santra received the P. G. degree

and Doctorate degree from I.I.T.,
Kharagpur in the year 1975 and 1981
respectively. He has got 20 years of
Teaching Experience and 19 years of
Industrial (Research) Experience. His area of interest
includes Artificial Intelligence, Neural Networks, Process
Modeling, Optimization and Control. He has got to his
credit (i) 35 Technical Research Papers which are
published in National / International Journals and
Seminars of repute, (ii) 20 Research Projects have been
completed in varied application areas, (iii) 2 Copy Rights
for Software Development have been obtained in the area
of Artificial Neural Networks (ANN) and (iv) he is the
contributor of the book entitled “Mathematics and its
Applications in Industry and Business”, Narosa
Publishing House, New Delhi. He is the recognized
Supervisor for guiding Ph. D. / M. S. (By Research)
Scholars of Anna University-Chennai, Anna University-
Coimbatore, Bharathiyar University, Coimbatore and
Mother Teresa University, Kodaikanal. Currently he is
guiding 12 Ph. D. Research Scholars in the Department.
He is a Life member of CSI and a Life member of ISTE.

An Improved Genetic Algorithm and Its Blending Application With Neural Network
No ratings yet
An Improved Genetic Algorithm and Its Blending Application With Neural Network
3 pages
Rule Acquisition in Data Mining Using Genetic Algorithm
No ratings yet
Rule Acquisition in Data Mining Using Genetic Algorithm
9 pages
Ga Parameters
No ratings yet
Ga Parameters
10 pages
Mining Association Rules Using Genetic Algorithm: The Role of Estimation Parameters
No ratings yet
Mining Association Rules Using Genetic Algorithm: The Role of Estimation Parameters
9 pages
Krish 2
No ratings yet
Krish 2
8 pages
Performance Evaluation of A Genetic Algorithm Based Approach To Network Intrusion Detection System
No ratings yet
Performance Evaluation of A Genetic Algorithm Based Approach To Network Intrusion Detection System
17 pages
The Enhanced Genetic Algorithms For The Optimization Design: Pengfei Guo Xuezhi Wang Yingshi Han
No ratings yet
The Enhanced Genetic Algorithms For The Optimization Design: Pengfei Guo Xuezhi Wang Yingshi Han
5 pages
Genetic Algorithms
No ratings yet
Genetic Algorithms
35 pages
Name: Survey1 Title: Genetic Algorithm Based On Evolution Strategy and The Alication in Data Mining 2.issue
No ratings yet
Name: Survey1 Title: Genetic Algorithm Based On Evolution Strategy and The Alication in Data Mining 2.issue
24 pages
Timetable by GA
No ratings yet
Timetable by GA
3 pages
Genetic Algorithms For Association Rule Mining: A Comparative Study
No ratings yet
Genetic Algorithms For Association Rule Mining: A Comparative Study
4 pages
GA Clustering
No ratings yet
GA Clustering
6 pages
Genetic Algorithm Based Bayesian Classification Algorithm For Object Oriented Data
No ratings yet
Genetic Algorithm Based Bayesian Classification Algorithm For Object Oriented Data
6 pages
Rule Acquisition in Data Mining Using A Self Adaptive Genetic Algorithm
No ratings yet
Rule Acquisition in Data Mining Using A Self Adaptive Genetic Algorithm
8 pages
P3Permutation Encoding TSP
No ratings yet
P3Permutation Encoding TSP
6 pages
Rule Acquisition in Data Mining Using A Self Adaptive Genetic Algorithm
No ratings yet
Rule Acquisition in Data Mining Using A Self Adaptive Genetic Algorithm
8 pages
Association Rule Mining Using Genetic Algorithm: The Role of Estimation Parameters
No ratings yet
Association Rule Mining Using Genetic Algorithm: The Role of Estimation Parameters
10 pages
3.sung Sam Hong Et Al
No ratings yet
3.sung Sam Hong Et Al
19 pages
Basic Concepts of Data Mining, Clustering and Genetic Algorithms
No ratings yet
Basic Concepts of Data Mining, Clustering and Genetic Algorithms
26 pages
Abstract - Genetic Algorithms (GA) Have Emerged As Practical, Robust Optimization and Search
No ratings yet
Abstract - Genetic Algorithms (GA) Have Emerged As Practical, Robust Optimization and Search
10 pages
Ga Perf Analysis
No ratings yet
Ga Perf Analysis
19 pages
Gentic Algorithm Report
No ratings yet
Gentic Algorithm Report
32 pages
Genetic Algorithm To Generate The Automatic Time-Table - An Over View
No ratings yet
Genetic Algorithm To Generate The Automatic Time-Table - An Over View
4 pages
DC Meet Second
No ratings yet
DC Meet Second
21 pages
Framework For Comparison of Association Rule Mining Using Genetic Algorithm
No ratings yet
Framework For Comparison of Association Rule Mining Using Genetic Algorithm
8 pages
Web Mining Based On Genetic Algorithm: AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
No ratings yet
Web Mining Based On Genetic Algorithm: AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
6 pages
83-522-1-PB Bees
No ratings yet
83-522-1-PB Bees
16 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Survey On GA and Rules
No ratings yet
Survey On GA and Rules
15 pages
A Genetic Algorithm For Discovering Classification Rules in Data Mining
No ratings yet
A Genetic Algorithm For Discovering Classification Rules in Data Mining
13 pages
Credit Risk Assessment Models
No ratings yet
Credit Risk Assessment Models
15 pages
An Entropy-Based Adaptive Genetic Algorithm For Learning Classification Rules
No ratings yet
An Entropy-Based Adaptive Genetic Algorithm For Learning Classification Rules
8 pages
Chaos Genetic Algorithm For Nonlinear Optimization
No ratings yet
Chaos Genetic Algorithm For Nonlinear Optimization
11 pages
Fuzzy Genetic Algorithm
No ratings yet
Fuzzy Genetic Algorithm
20 pages
Size Population Vs Number Generations
No ratings yet
Size Population Vs Number Generations
24 pages
The Application Research of Genetic Algorithm: Jumei Zhang
No ratings yet
The Application Research of Genetic Algorithm: Jumei Zhang
4 pages
A Study On Genetic Algorithm and Its Applications: Related Papers
No ratings yet
A Study On Genetic Algorithm and Its Applications: Related Papers
6 pages
Image Clustering Using Genetic Algorithm With Tour
No ratings yet
Image Clustering Using Genetic Algorithm With Tour
7 pages
Advanced NLP & Regular Expressions
No ratings yet
Advanced NLP & Regular Expressions
102 pages
Genetic Algorithm & K-Means Clustering
No ratings yet
Genetic Algorithm & K-Means Clustering
40 pages
Video 18
No ratings yet
Video 18
17 pages
Genetic Algorithms in Data Mining
No ratings yet
Genetic Algorithms in Data Mining
4 pages
A Genetic Algorithm Based Clustering Framework For Detection of Software Process Model Compatibilities
No ratings yet
A Genetic Algorithm Based Clustering Framework For Detection of Software Process Model Compatibilities
5 pages
Indira 2011
No ratings yet
Indira 2011
10 pages
ML Unit-IV Chapter-I Genetic Algorithms
No ratings yet
ML Unit-IV Chapter-I Genetic Algorithms
35 pages
Genetic Algorithms in Computing
No ratings yet
Genetic Algorithms in Computing
6 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
7 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
4 pages
Financial Forecasting Using Genetic Algorithms
No ratings yet
Financial Forecasting Using Genetic Algorithms
13 pages
File PHP
No ratings yet
File PHP
5 pages
Hybrid Classifier Using Evolutionary and Non-Evolutionary Algorithm For Performance Enhancement in Data Mining
No ratings yet
Hybrid Classifier Using Evolutionary and Non-Evolutionary Algorithm For Performance Enhancement in Data Mining
6 pages
1 s2.0 S0950705114002937 Main
No ratings yet
1 s2.0 S0950705114002937 Main
21 pages
1.implement FIND-S Algorithm: Desription
No ratings yet
1.implement FIND-S Algorithm: Desription
19 pages
AI-900 Exam
No ratings yet
AI-900 Exam
161 pages
Default Payment Analysis of Credit Card Clients: July 2018
No ratings yet
Default Payment Analysis of Credit Card Clients: July 2018
7 pages
Iris Flower Classification Report
No ratings yet
Iris Flower Classification Report
60 pages
Genetic Algorithm Applied in Clustering Datasets
No ratings yet
Genetic Algorithm Applied in Clustering Datasets
7 pages
Machine Learning Programming
No ratings yet
Machine Learning Programming
4 pages
Linear Regression for CPU User Mode Prediction
No ratings yet
Linear Regression for CPU User Mode Prediction
109 pages
FRA Milestone 2
No ratings yet
FRA Milestone 2
16 pages
Sleep Data Analysis for Students
No ratings yet
Sleep Data Analysis for Students
57 pages
Matjie - LK Final Report (201904606) V1
No ratings yet
Matjie - LK Final Report (201904606) V1
24 pages
A Deep Learning Based Object Detection System For User Interface Code Generation
No ratings yet
A Deep Learning Based Object Detection System For User Interface Code Generation
5 pages
Machine Learning Eviction in OpenFlow
No ratings yet
Machine Learning Eviction in OpenFlow
8 pages
Ai-102 8
No ratings yet
Ai-102 8
36 pages
1 s2.0 S004896972200403X Main
No ratings yet
1 s2.0 S004896972200403X Main
7 pages
Machine Learning for Fake Review Detection
No ratings yet
Machine Learning for Fake Review Detection
27 pages
Genetic Algothim AI
No ratings yet
Genetic Algothim AI
16 pages
Unit 5 Machine Learning Aktu
No ratings yet
Unit 5 Machine Learning Aktu
7 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
9 pages
Approach Using Genetic Algorithm For Intrusion Detection System 3
No ratings yet
Approach Using Genetic Algorithm For Intrusion Detection System 3
5 pages
Blending Roulette Wheel Selection & Rank Selection in Genetic Algorithms
No ratings yet
Blending Roulette Wheel Selection & Rank Selection in Genetic Algorithms
6 pages
Data Mining Exam for B.Sc. Students
No ratings yet
Data Mining Exam for B.Sc. Students
6 pages
Olympic Dataset 1
No ratings yet
Olympic Dataset 1
12 pages
Classification of Palm Trees Diseases Using Convolution Neural Network
No ratings yet
Classification of Palm Trees Diseases Using Convolution Neural Network
8 pages
Advancements in Machine Learning For Early Detection of Plant Diseases
No ratings yet
Advancements in Machine Learning For Early Detection of Plant Diseases
9 pages
Soft Computing (Genetic Algorithm Past, Present, Future)
No ratings yet
Soft Computing (Genetic Algorithm Past, Present, Future)
17 pages
Vignesh's Documentation
No ratings yet
Vignesh's Documentation
59 pages
Genetic Algorithm
No ratings yet
Genetic Algorithm
22 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
ANN-unit 3
No ratings yet
ANN-unit 3
30 pages
AI & DS-II MU QPaper Solution (June 2023)
No ratings yet
AI & DS-II MU QPaper Solution (June 2023)
21 pages
Data Mining 2
No ratings yet
Data Mining 2
96 pages
27 CameraReadySubmission CL Dice Neurips Med
No ratings yet
27 CameraReadySubmission CL Dice Neurips Med
5 pages
Genetic Algorithms in Construction Project Management A Review66666
No ratings yet
Genetic Algorithms in Construction Project Management A Review66666
20 pages
Coding Questions
No ratings yet
Coding Questions
124 pages
ML Evaluation Metrics CheatSheet
No ratings yet
ML Evaluation Metrics CheatSheet
3 pages
PDFF
No ratings yet
PDFF
15 pages
Wongoutong (2024) - Kmeans Clustering
No ratings yet
Wongoutong (2024) - Kmeans Clustering
19 pages

Genetic Algorithm and Confusion Matrix For Document Clustering

Uploaded by

Genetic Algorithm and Confusion Matrix For Document Clustering

Uploaded by

IJCSI International Journal of Computer Science Issues, Vol.

9, Issue 1, No 2, January 2012

Genetic Algorithm and Confusion Matrix for Document Clustering

A. K. Santra, C. Josephine Christy,

that the occurrence of each word in a document is

Datas Document Relevant Negative Positive

The entries in the confusion matrix have the following

Measured Best  d is the number of correct predictions that an

a among dissimilar solutions. The flow of the algorithm is

values of feature recall and precision are desired. Note 0.6

that, with respect to the real data, we report only the

number of feature selected, since the relevant features are 0.2

6. Experimental Result and Discussion Improved GA Niching memetic algorithm

The proposed method was tested with a file of

to create training sets as follows: γ percent of the 60

documents from the positive class is first selected as the 40

positive set P. The rest of the positive documents and 20

negative documents are used as unlabeled set U. We range 0

γ percent from 10%- 50% to create a wide range of 0 5 10 15 20 25 30

Fig 5: Cluster vs purity

following pre-processing steps: (1) First, we removed all Cluster object

punctuation marks and numbers; (2) then, we extracted all

60 for VLSI Design Automation = Genetic Algorithms +

IEEE Trans. Information Technology in Biomedicine, vol.

[5] Y. Zhao and G. Karypis, “Empirical and Theoretical

A. K. Santra received the P. G. degree

You might also like