Codestin Search App

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
NG_Data		NG_Data
Python		Python
.classpath		.classpath
.project		.project
CleanOutput5$Cluster.class		CleanOutput5$Cluster.class
CleanOutput5.class		CleanOutput5.class
CleanOutput5.java		CleanOutput5.java
Clusterer5$Cluster.class		Clusterer5$Cluster.class
Clusterer5.class		Clusterer5.class
Clusterer5.java		Clusterer5.java
Clusterer5NoMap$Cluster.class		Clusterer5NoMap$Cluster.class
Clusterer5NoMap.class		Clusterer5NoMap.class
Clusterer5NoMap.java		Clusterer5NoMap.java
MapGenes.class		MapGenes.class
MapGenes.java		MapGenes.java
PreProcessData.class		PreProcessData.class
PreProcessData.java		PreProcessData.java
README.TXT		README.TXT
SCuBA.iml		SCuBA.iml
SortbyLength2$Element.class		SortbyLength2$Element.class
SortbyLength2.class		SortbyLength2.class
SortbyLength2.java		SortbyLength2.java
TopDown16$Cluster.class		TopDown16$Cluster.class
TopDown16.class		TopDown16.class
TopDown16.java		TopDown16.java
result.txt		result.txt

Repository files navigation

Execution Sequence
==================

Execute the JAVA source files in the following order. Make sure the input and output filenames are consistent at each step.

1. preprocessdata.java
2. topdown16.java
3. sortbylength2.java
4. cleanoutput5.java
5. clusterer5nomap.java


Data Preperation
================

Input data file looks like a matrix of data points as rows and features as columns. Each entry represents the frequency of that feature in the data point. The data matrix for CLASSIC3 data set is present in "dataset_NG.txt" in the NG_Data folder.


Reading the Output
==================

Output file is "result.txt" in the root folder. It contains entries like "671 5260 7824 : 462 1458 2820 3685" here elements before the ":" represent the group for features ids and elements after ":" refer to the group of document ids which contain these features.
Calculation of accuracy is done by the python code accuracy.py present in the python folder.
It uses col.txt and row.txt which contain the mapping between features & feature id and document id and document types.