Lab Manual
of
Data Mining & Data Warehousing
(CSN447)
MASTER OF COMPUTER APPLICATION
Session 2024-25
SCHOOL OF COMPUTING
DIT University
Submitted to: Submitted by:
Dr.R.K.Saini Sawan
School of Computing 1000026853
DIT University,Dehradun MCA(P1)
EXPERIMENT 1 : Exploring weka tool
WHAT IS WEKA?
WEKA - an open source software provides tools for data preprocessing, implementation of
several Machine Learning algorithms, and visualization tools so that you can develop
machine learning techniques and apply them to real-world data mining problems. What
WEKA offers is summarized in the following diagram.
STEP 1: Installing WEKA TOOL in virtual machine using internet explorer.
The WEKA GUI Chooser application will start and you would see the following :
The GUI Chooser application allows you to run five different types of applications as listed here:
· Explorer
· Experimenter
· KnowledgeFlow
· Workbench
· Simple CLI
WEKA TOOL INTERFACE
STEP 2: Opening Weka Explorer And Opening Dataset Using The Following Path:
When you click on the Explorer button in the Applications selector, it opens the following screen.
On the top, you will see several tabs as listed here:
· Preprocess
· Classify
· Cluster
· Associate
· Select Attributes
· Visualize
Click on the Open file ... button. A directory navigator window opens as shown in the
following screen, following the below path to load data.
C:\Program Files\Weka-3-8-6\data
STEP 3: Visualizing The Given Data Graphically
STEP 4: Every Attribute Along With It’s Graph Has Some Conclusion
Example the time duration of the labour dataset, its graph and conclusion are as follows:
Here we can see from the graph as well as the pre calculated value by weka tool:
The mean of the time duration is : 2.161
The standard deviation is: 0.707
Increase of wage in second year, it’s graph and conclusion.
Here we can see from the graph as well as the pre calculated value by weka tool:
The mean of the wage increment in second year is : 3.972
The standard deviation is: 1.164
Increase of wage in second year, it’s graph and conclusion.
Here we can see from the graph as well as the pre calculated value by weka tool:
The mean of the wage increment in second year is : 38.039
The standard deviation is: 2.506
Increase of wage in third year, it’s graph and conclusion.
Here we can see from the graph as well as the pre calculated value by weka tool:
The mean of the wage increment in third year is : 3.913
The standard deviation is: 1.304
EXPERIMENT 2 : Creating a new ARFF FILE
Code:
@relation Student
@attribute stud_id numeric
@attribute stud_name string
@attribute stud_age numeric
@attribute stud_dept {CSE, IT, BCA, MCA}
@attribute stud_gender {male, female}
@attribute stud_marks numeric
@attribute stud_city string
@attribute stud_Dob date "dd-MM-yyyy"
@data
1, Riya, 22, CSE, female, 90, Shamli, 22-03-2003
2, Tanish, 21, IT, female, 91, Aligarh, 17-01-2004
3, Parul, 22, MCA, female, 89, Kota, 05-10-2003
4, Shreyashi, 25, CSE, female, 87, Kotdwar, 13-11-1997
5, Ayesha, 23, IT, female, 85, Mumbai, 10-10-2000
6, Tanu, 21, CSE, female, 88, Jaipur, 15-05-2002
7, Siddharth, 24, IT, male, 86, Pune, 20-08-1999
8, Shreya, 20, MCA, female, 92, Chennai, 25-12-2003
9, Rahul, 22, CSE, male, 89, Hyderabad, 03-07-2002
10, Neha, 23, IT, female, 84, Bengaluru, 18-09-2001
Images:
EXPERIMENT 3 : Data Processing Techniques On Dataset
Pre-Processing involves
1. Converting Nominal to Binary, Numeric to Nominal(in the form of 0 or 1)
Output:
2. Detecting the missing value and replacing with a user constant value or system
generated value.
Output:
3. Detect the Outliers and Remove it using interquartile.
Output:
EXPERIMENT 4 : Create a cube and illustrate the following OLAP
operations.
1. Rollup 2 ) Drill down 3) Slice 4) Dice 5)Pivot
Rollup:
(a) Concept: This operation aggregates data to a higher level. Essentially, it
reduces the granularity of the data.
(b) Example: You may roll up from "Month" to "Quarter" or "Region" to
"Country."
(c) Visual: Imagine a cube with dimensions like Time (Month), Product
(Category), and Region. Rolling up from "Month" to "Quarter" would
consolidate all the monthly data into quarterly data.
Drill Down:
(a) Concept: This is the opposite of rollup. It allows you to go from higher
levels of aggregation to more detailed data.
(b) Example: You drill down from "Country" to "Region" or from "Year" to
"Month."
(c) Visual: In a cube, this would involve expanding a higher-level value (like
"Year") into more detailed values (like "Month").
Slice:
(a) Concept: The slice operation is used to view a subset of the cube by
fixing one of the dimensions to a specific value.
(b) Example: You might slice the cube to look at data for a specific year
(e.g., 2023) while keeping other dimensions, like product and region,
unchanged.
(c) Visual: Imagine you have a cube with three dimensions: Time
(Month), Product (Category), and Region. If you slice by Time =
"2023," you'll have a 2D view of the data for just that year.
Dice:
(a) Concept: This operation is similar to slicing, but you select specific
values from multiple dimensions, creating a smaller cube.
(b) Example: You might dice the cube to focus on a specific "Region"
and "Product" while selecting all months.
(c) Visual: If you slice by "Region = North America" and "Product =
Laptops," the result would be a smaller cube that only has data for
those selections, keeping time (e.g., month) as a dimension.
Pivot:
(a) Concept: The pivot operation rotates the data, changing the
orientation of the cube to view it from different perspectives.
(b) Example: If you have a table with regions as rows and months as
columns, pivoting would flip the rows and columns, making months
rows and regions columns.
(c) Visual: Pivoting would involve rotating the cube so you can change
how dimensions are organized (rows vs columns).
Code:
@relation sales_cube
@attribute Month {Jan, Feb, Mar}
@attribute City {Delhi, Mumbai, Bangalore}
@attribute Product {Laptop, Phone, Tablet}
@attribute Sales numeric
@data
Jan,Delhi,Laptop,120
Jan,Delhi,Phone,100
Jan,Mumbai,Laptop,150
Feb,Delhi,Laptop,130
Feb,Mumbai,Phone,170
Feb,Bangalore,Tablet,90
Mar,Delhi,Tablet,200
Mar,Mumbai,Phone,180
Mar,Bangalore,Laptop,160
Output:
EXPERIMENT 5: Demonstrate performing Regression on data sets
Code:
@relation house_prices
@attribute size numeric
@attribute bedrooms numeric
@attribute price numeric
@data
1000, 3, 200000
1200, 3, 250000
1500, 4, 300000
800, 2, 180000
950, 2, 190000
1400, 4, 280000
Output:
EXPERIMENT 6: Implementation of Decision Tree and Random Forest Tree
Induction
The J48 algorithm in Weka is the implementation of the C4.5 decision tree algorithm, which is a commonly
used decision tree algorithm.
Steps to Apply Decision Tree in Weka:
1. Load the dataset:
o Open Weka.
o Click on Explorer.
o Click on Open file and select the dataset you want to use (e.g., .arff or .csv format).
2. Choose the J48 classifier:
o In the "Classify" tab, click the Choose button.
o Under trees, select J48 (this is Weka's implementation of the C4.5 algorithm).
3. Set the options (optional):
o Click on the text next to J48 to open the options window.
o Here, you can change parameters such as:
confidenceFactor (default: 0.25): The confidence level for pruning.
minNumObj (default: 2): The minimum number of objects (instances) for a leaf
node.
unpruned: If checked, disables pruning (leads to a fully-grown tree).
For example:
o To use a confidence factor of 0.1, type -C 0.1 in the options field.
4. Train the model:
o Click the Start button to build the decision tree model.
5. View the output:
o Once the model has finished building, Weka will show the decision tree structure in the result
area. You can see the accuracy and other performance metrics like Precision, Recall, and F1-
score as well.
Using Random Forest in Weka
A Random Forest is an ensemble method that builds multiple decision trees and combines their predictions.
Weka has an implementation called Random Forest.
Steps to Apply Random Forest in Weka:
1. Load the dataset:
o Similar to the Decision Tree, open Weka, click on Explorer, and load your dataset.
2. Choose the Random Forest classifier:
o In the "Classify" tab, click the Choose button.
o Under trees, select RandomForest.
3. Set the options (optional):
o Click on the text next to RandomForest to open the options window.
o The default parameters include:
numTrees (default: 100): The number of trees to build.
maxDepth (default: 0, meaning no limit on tree depth).
seed: The random seed for reproducibility.
bagSizePercent (default: 100): Percentage of the training set used for building each
tree (using bootstrapping).
For example:
o To use 200 trees, set -I 200.
o To limit the depth of trees to 10, set -depth 10.
4. Train the model:
o Click the Start button to build the random forest model.
5. View the output:
o After training, Weka will display the performance metrics such as accuracy, confusion
matrix, and more. You can also examine the structure of the individual trees if needed.
Visualization:-
Experiment 7: Implementation of Naïve Bayesian Classifier
Code:
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
Output:
Experiment – 8: Calculating Information gains measures
Code:
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
Output:
Experiment 9: Implementation of Apriori algorithm for Market Basket Analysis.
Code:
@relation market_basket
@attribute bread {t, f}
@attribute milk {t, f}
@attribute butter {t, f}
@attribute beer {t, f}
@attribute eggs {t, f}
@data
t, t, f, f, t
t, f, t, f, f
f, t, t, t, t
t, t, t, f, f
f, f, t, t, t
Output:
Experiment 10: Implementation of K-means algorithm for Clustering
Code:
@relation simple_clusters
@attribute height numeric
@attribute weight numeric
@data
170, 60
180, 80
160, 55
175, 77
165, 58
155, 50
185, 85
Output: