Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
34 views27 pages

Data Mining Complete Lab Manual - DRSNR

The document outlines a data mining lab course focused on using the WEKA toolkit for various data mining tasks, including data pre-processing, classification, clustering, and association rule mining. It details procedures for installing WEKA, understanding the ARFF file format, and performing specific algorithms such as Apriori and J48. The lab includes hands-on activities with datasets to explore data attributes, handle missing values, and visualize results.

Uploaded by

viswa225574
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views27 pages

Data Mining Complete Lab Manual - DRSNR

The document outlines a data mining lab course focused on using the WEKA toolkit for various data mining tasks, including data pre-processing, classification, clustering, and association rule mining. It details procedures for installing WEKA, understanding the ARFF file format, and performing specific algorithms such as Apriori and J48. The lab includes hands-on activities with datasets to explore data attributes, handle missing values, and visualize results.

Uploaded by

viswa225574
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 27

Subject: DATA MINING LAB

Class : VI CSE-C

Faculty : Dr S. NageswaraRao
INDEX

1 Explore WEKA Data Mining/Machine Learning Toolkit

2 i) Study the .arff file format


ii) Explore the available data sets in WEKA? Load and observe each
dataset.
3 Perform data pre-processing tasks
i) Handling missing values
ii) Applying normalization on numeric attributes
4 Load each dataset into Weka and run Apriori algorithm
5 Demonstrate performing classification on data sets

6 Extract if-then rules from the decision tree generated by the classifier

7 Load labelled dataset into Weka and perform Naive-bayes classification and k-
Nearest Neighbour classification. Interpret the results obtained.

8 i) Cluster the given data set using k-means clustering algorithm


ii) Explore visualization features of Weka to visualize the clusters

9. i) Apply the hierarchical and grid based Clustering techniques on the given data
set
ii) Explore visualization features of Weka to visualize the clusters
10 Apply PCA (Principle Component Analysis) dimensionality reduction technique
on the given data set.
11 Demonstrate performing Regression on data sets

12 Credit Risk Assessment – The German Credit Data


3

1) Explore WEKA Data Mining/Machine Learning Toolkit

(i). Downloading and/or installation of WEKA data mining toolkit

Procedure:
1. Go to the Weka website, http://www.cs.waikato.ac.nz/ml/weka/, and download the software.
On the left-hand side, click on the link that says download.
2. Select the appropriate link corresponding to the version of the software based on your
operating system and whether or not you already have Java VM running on your machine (if you
don‘t know what Java VM is, then you probably don‘t).
3. The link will forward you to a site where you can download the software from a mirror site.
Save the self-extracting executable to disk and then double click on it to install Weka. Answer
yes or next to the questions during the installation.
4. Click yes to accept the Java agreement if necessary. After you install the program Weka
should appear on your start menu under Programs (if you are using Windows).
5. Running Weka from the start menu select Programs, then Weka.You will see the Weka GUI
Chooser. Select Explorer. The Weka Explorer will then launch.

Data Mining Lab Dept of CSE KSRMC


E
4

(ii). Understand the features of WEKA toolkit such as Explorer, Knowledge


Flow interface, Experimenter, command-line interface.

The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting pointfor launching
Weka‘s main GUI applications and supporting tools. If one prefersa MDI (―multiple document
interface‖) appearance, then this is provided by analternative launcher called ―Main‖
(class weka.gui.Main).
The GUI Chooser consists of four buttons—one for each of the four majorWeka applications—
and four menus.

The buttons can be used to start the following applications:


Explorer- An environment for exploring data with WEKA
a) Click on ―explorer‖ button to bring up the explorer window.
b) Make sure the ―preprocess‖ tab is highlighted.
c) Open a new file by clicking on ―Open New file‖ and choosing a file with ―.arff
extension from the ―Data‖ directory.
d) Attributes appear in the window below.
e) Click on the attributes to see the visualization on the right.
f) Click ―visualize all‖ to see them all

Experimenter- An environment for performing experiments and conducting statistical tests


between learning schemes.
a) Experimenter is for comparing results.
b) Under the ―set up‖ tab click ―New‖.
c) Click on ―Add New‖ under ―Data‖ frame. Choose a couple of arff format files from
―Data‖ directory one at a time.
d) Click on ―Add New‖ under ―Algorithm‖ frame. Choose several algorithms, one at a time
by clicking ―OK‖ in the window and ―Add New‖.
e) Under the ―Run‖ tab click ―Start‖.

Data Mining Lab Dept of CSE KSRMC


E
5

f) Wait for WEKA to finish.


g) Under ―Analyses‖ tab click on ―Experiment‖ to see results.

Knowledge Flow- This environment supports essentially the same functions as the Explorer but
with a drag-and-drop interface. One advantageis that it supports incremental learning.
SimpleCLI - Provides a simple command-line interface that allows directexecution of WEKA
commands for operating systems that do not provide their own command line interface.
(iii).Navigate the options available in the WEKA (ex. Select attributes panel, Preprocess
panel, classify panel, Cluster panel, Associate panel and Visualize panel)

When the Explorer is first started only the first tab is active; the others are greyed out. This is
because it is necessary to open (and potentially pre-process) a data set before starting to explore
the data.
The tabs are as follows:
1. Preprocess. Choose and modify the data being acted on.
2. Classify. Train and test learning schemes that classify or perform regression.
3. Cluster. Learn clusters for the data.
4. Associate. Learn association rules for the data.
5. Select attributes. Select the most relevant attributes in the data.
6. Visualize. View an interactive 2D plot of the data.
Once the tabs are active, clicking on them flicks between different screens, on which the
respective actions can be performed. The bottom area of the window (including the status box,
the log button, and the Weka bird) stays visible regardless of which section you are in.

1. Preprocessing

Loading Data:

The first four buttons at the top of the preprocess section enable you to loaddata into WEKA:

Data Mining Lab Dept of CSE KSRMC


E
6

1. Open file.... Brings up a dialog box allowing you to browse for the datafile on the local file
system.
2. Open URL....Asks for a Uniform Resource Locator address for wherethe data is stored.
3. Open DB.....Reads data from a database. (Note that to make this workyou might have to edit
the file in weka/experiment/DatabaseUtils.props.)
4. Generate.. . .Enables you to generate artificial data from a variety ofDataGenerators.
Using the Open file. . .button you can read files in a variety of formats:
WEKA‘s ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF files
typically have a .arff extension, CSV files a .csv extension,C4.5 files a .data and .names
extension, and serialized Instances objects a .bsiextension.
2. Classification:

Selecting a Classifier

At the top of the classify section is the Classifier box. This box has a text fieldthat gives the
name of the currently selected classifier, and its options. Clickingon the text box with the left
mouse button brings up a GenericObjectEditordialog box, just the same as for filters, that you
can use to configure the optionsof the current classifier. With a right click (or Alt+Shift+left
click) you canonce again copy the setup string to the clipboard or display the properties in
aGenericObjectEditor dialog box. The Choose button allows you to choose on4eof the classifiers
that are available in WEKA.
Test Options

Data Mining Lab Dept of CSE KSRMC


E
7

The result of applying the chosen classifier will be tested according to the optionsthat are set by
clicking in the Test options box. There are four test modes:
1. Use training set: The classifier is evaluated on how well it predicts theclass of the instances
it was trained on.
2. Supplied test set: The classifier is evaluated on how well it predicts theclass of a set of
instances loaded from a file. Clicking the Set... buttonbrings up a dialog allowing you to
choose the file to test on.
3. Cross-validation: The classifier is evaluated by cross-validation, usingthe number of folds
that are entered in the Folds text field.
4. Percentage split: The classifier is evaluated on how well it predicts acertain percentage of the
data which is held out for testing. The amountof data held out depends on the value entered in
the % field.
3. Clustering:

Cluster Modes:

The Cluster mode box is used to choose what to cluster and how to evaluatethe results. The first
three options are the same as for classification: Use training set, Supplied test set and
Percentage split.
4. Associating:

Data Mining Lab Dept of CSE KSRMC


E
8

Setting Up
This panel contains schemes for learning association rules, and the learners are chosen and
configured in the same way as the clusterers, filters, and classifiers in the other panels.

5. Selecting Attributes:

Searching and Evaluating


Attribute selection involves searching through all possible combinations of attributes in the data
to find which subset of attributes works best for prediction. To do this, two objects must be set
up: an attribute evaluator and a searchmethod. The evaluator determines what method is used to
Data Mining Lab Dept of CSE KSRMC
E
9

assign a worth toeach subset of attributes. The search method determines what style of searchis
performed.
6. Visualizing:

WEKA‘s visualization section allows you to visualize 2D plots of the current relation.

2.i) Study the arff file format

An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list
of instances sharing a set of attributes. ARFF files were developed by the Machine Learning
Project at the Department of Computer Science of The University of Waikato for use with
the Weka machine learning software.

Overview

ARFF files have two distinct sections. The first section is the Header information, which is
followed the Data information.

The Header of the ARFF file contains the name of the relation, a list of the attributes (the
columns in the data), and their types. An example header on the standard IRIS dataset looks like
this:

% 1. Title: Iris Plants Database


%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%[email protected])
% (c) Date: July, 1988
Data Mining Lab Dept of CSE KSRMC
E
10

%
@RELATION iris

@ATTRIBUTE sepallength NUMERIC


@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

The Data of the ARFF file looks like the following:

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments.
The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

2.ii) Explore the available data sets in WEKA

There are 23 different datasets are available in weka (C:\Program Files\Weka-3-6\) by default for
testing purpose. All the datasets are available in. arff format. Those datasets are listed below.

Data Mining Lab Dept of CSE KSRMC


E
11

Load a data set (ex. Weather dataset, Iris dataset, etc.)

Procedure:
1. Open the weka tool and select the explorer option.
2. New window will be opened which consists of different options (Preprocess, Association etc.)
3. In the preprocess, click the ―open file‖ option.
4. Go to C:\Program Files\Weka-3-6\data for finding different existing. arff datasets.
5. Click on any dataset for loading the data then the data will be displayed as shown below.

Load each dataset and observe the following:

Data Mining Lab Dept of CSE KSRMC


E
12

Here we have taken IRIS.arff dataset as sample for observing all the below things.

List the attribute names and they types

There are 5 attributes& its datatype present in the above loaded dataset (IRIS.arff)
sepallength – Numeric
sepalwidth – Numeric
petallength – Numeric
petallength – Numeric
Class – Nominal
i. Number of records in each dataset

There are total 150 records (Instances) in dataset (IRIS.arff).

ii. Identify the class attribute (if any)

There is one class attribute which consists of 3 labels. They are:


1. Iris-setosa
2. Iris-versicolor
3. Iris-virginica

iii. Plot Histogram

Data Mining Lab Dept of CSE KSRMC


E
13

iv. Determine the number of records for each class.

There is one class attribute (150 records) which consists of 3 labels. They are shown below
1. Iris-setosa - 50 records
2. Iris-versicolor – 50 records
3. Iris-virginica – 50 records

v. Visualize the data in various dimensions

Data Mining Lab Dept of CSE KSRMC


E
14

3) Perform data pre-processing tasks


o Handling missing values
o Applying normalization on numeric attributes

Handling missing values


Description: Missing values for numerical features are replaced with mean values, for nominal
features replaced with most frequently occurred term.

I weka it is implemented with the following steps


1) Select data set called “cpu-with -vendor”
2) Select option choose—unsupervised—attribute—Replacewithmissingvalues, it introduce
some missing values in each attribute.
3) Missing values can be handled using any one of the options
 choose—unsupervised—attribute—Replacemissingvalues
 choose—unsupervised—attribute—Replacemissingvalueswithuserconstant

Applying normalization on numeric attributes

Description: in normalization all numerical attribute values are converted into the default scale of
[0-1]

I weka it is implemented with the following steps

1) Select data set called “cpu”


2) Select option choose—unsupervised—attribute—Normalize,
It normalize all attribute values in the default range of 0 to1, this range can be changed by
double clicking on Normalize S-0-T-1

4) LOAD EACH DATASET INTO WEKA AND RUN APRIORI ALGORITHM

Data Mining Lab Dept of CSE KSRMC


E
15

Step1: Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized. In this example it is age attribute.

Step2: Clicking on the associate tab will bring up the interface for association rule algorithm.

Step3: We will use apriori algorithm. This is the default algorithm.

Step4: Inorder to change the parameters for the run (example support, confidence etc) we click on the
text box immediately to the right of the choose button

Data Mining Lab Dept of CSE KSRMC


E
16

Data Mining Lab Dept of CSE KSRMC


E
17

5) Demonstrate performing classification on data sets


Procedure for J48:
1. Load the dataset (Contact-lenses.arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see differentclassification algorithms
under tree section.
3. In which we selected J48 algorithm,in more options select the output entropy evaluation
measures& click on start option.
4. Then we will get classifier output, entropy values & Kappa Statistic as represented below.
5. In the below screenshot, we can run classifiers with different test options (Cross-validation, Use
Training Set, Percentage Split, Supplied Test set).

A. Use training set: The classifier is evaluated on how well it predicts theclass of the instances it
was trained on.
B. Supplied test set: The classifier is evaluated on how well it predicts theclass of a set ofinstances
loaded from a file. Clicking the Set... buttonbrings up a dialog allowing you to choose the file to test
on.

Data Mining Lab Dept of CSE KSRMC


E
18

C. Cross-validation: The classifier is evaluated by cross-validation, usingthe number of folds that


are entered in the Folds text field.
D. Percentage split: The classifier is evaluated on how well it predicts acertain percentage of the
data which is held out for testing. The amountof data held out depends on the value entered in the %
field.

6) Extract if-then rules from the decision tree generated by the classifier

Procedure:
1. Load the dataset (Iris-2D. arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see differentclassification algorithms
under rules section.
3. In which we selected Decisionstump (If-then) algorithm & click on start option with ―use training
set‖ test option enabled.
4. Then we will get detailed accuracy by class consists ofF-measure, TP rate, FP rate, Precision,
Recall values& Confusion Matrix as represented below.

7) Load labelled dataset into Weka and perform Naive-bayes classification and k-Nearest
Neighbour classification. Interpret the results obtained.

Procedure for Naïve-Bayes:


1. Load the dataset (Iris-2D. arff) into weka tool
2. Go to classify option & in left-hand navigation bar we can see differentclassification algorithms
under bayes section.
3. In which we selected Naïve-Bayes algorithm & click on start option with ―use training set‖ test
option enabled.
4. Then we will get detailed accuracy by class consists of F-measure, TP rate, FP rate, Precision,
Recall values& Confusion Matrix as represented below.

Data Mining Lab Dept of CSE KSRMC


E
19

Procedure for K-Nearest Neighbour (IBK):


1. Load the dataset (Iris-2D. arff) into weka tool

Data Mining Lab Dept of CSE KSRMC


E
20

2. Go to classify option & in left-hand navigation bar we can see differentclassification algorithms
under lazy section.
3. In which we selected K-Nearest Neighbour (IBK) algorithm & click on start option with ―use
training set‖ test option enabled.
4. Then we will get detailed accuracy by class consists of F-measure, TP rate, FP rate, Precision,
Recall values& Confusion Matrix as represented below.

8) i) Cluster the given data set using k-means clustering algorithm


ii) Explore visualization features of Weka to visualize the clusters

k-means clustering

1. Load the dataset (Iris.arff) into weka tool


2. Go to cluster option & we can see different clustering algorithms.
3. In which we selected Simple K-Means algorithm & click on start option with ―use training set‖ test
option enabled.
4. Then we will get the sum of squared errors, centroids, No. of iterations & clustered instances as
represented below.
5. If we right click on simple k means, we will get more options in which ―Visualize cluster assignments‖
should be selected for getting cluster visualization as shown below.

Data Mining Lab Dept of CSE KSRMC


E
21

Explore visualization features of Weka to visualize the clusters

 If we right click on simple k means, we will get more options in which ―Visualize cluster

Data Mining Lab Dept of CSE KSRMC


E
22

assignments‖ should be selected for getting cluster visualization as shown below.


 In that cluster visualization we are having different features to explore by changing the X-axis,
Y-axis, Color, Jitter& Select instance (Rectangle, Polygon & Polyline) for getting different sets
of cluster outputs.

 As shown in above screenshot, all the dataset (Iris.arff) tuples are represented in X-axis & in
similar way it will represented for y-axis also. For each cluster, the color will be different. In the
above figure, there are two clusters which are represented in blue & red colors. In the select
instance we can select different shapes for choosing clustered area as shown in below
screenshot, rectangle shape is selected.

Data Mining Lab Dept of CSE KSRMC


E
23

 By this visualization feature we can observe different clustering outputs for an dataset by
changing those X-axis, Y-axis, Color & Jitter options.

Data Mining Lab Dept of CSE KSRMC


E
24

9) Apply the hierarchical and grid based Clustering techniques on the given data set

1. Load the dataset (Iris.arff) into weka tool


2. Go to cluster option & we can see different clustering algorithms.
3. In which we selected HierarchicalClusterer algorithm & click on start option with ―use training set‖
test option enabled.
4. Then we will get the sum of squared errors, centroids, No. of iterations & clustered instances as
represented below.
5. If we right click on simple k means, we will get more options in which ―Visualize cluster assignments‖
should be selected for getting cluster visualization as shown below.

Data Mining Lab Dept of CSE KSRMC


E
25

10) Apply PCA (Principle Component Analysis) dimensionality reduction technique on the given
data set.

1. Load the dataset (Iris.arff) into weka tool


2. Go to Filters->unsupervised-> attributes->principlecomponent
3. If we right click on principlecomponent, we will get more options in which ―MaximumAttributes=2
4.Visualize the reduced dimensions on clicking right upper option visualization

Data Mining Lab Dept of CSE KSRMC


E
26

11) Demonstrate performing Regression on data sets

Data Mining Lab Dept of CSE KSRMC


E
27

Data Mining Lab Dept of CSE KSRMC


E

You might also like