Dw m Practical
Dw m Practical
LAB MANUAL
(ACADEMIC RECORD)
CLASS: TE
SEMESTER: V
Program Outcomes as defined by NBA (PO)
Engineering Graduates will be able to:
1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and an
engineering specialization to the solution of complex engineering problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with an
understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities relevant to the professional
engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of
the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and
design documentation, make effective presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the engineering
and management principles and apply these to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
Institute Vision : To create value - based technocrats to fit in the world of
work and research
Course Outcomes
CSC504.3 Identify appropriate data mining algorithms to solve real world problems.
Describe complex information and social networks with respect to web mining.
CSC504.5
DATTA MEGHE COLLEGE OF ENGINEERING
AIROLI, NAVI MUMBAI - 400708
CERTIFICATE
Date _________________
Examined on:
List of Experiments
Course Name :Data Warehousing and Mining
Datta Meghe College of Engineering
Course Code :CSC504
Airoli, Navi Mumbai
Implementation of all
dimension tables and fact
table based on experiment 1
2 CSC504.1
case study.
To implement OLAP
operations : Slice, Dice,
Roll up, Drill down, and
3 CSC504.1
Pivot based on experiment 1
case study.
Implementation of
4 Bayesian Classification CSC504.3
Algorithm.
Implementation of Data
Discretization and
5 Visualization CSC504.2
Perform data preprocessing
task and demonstrate CSC504.2,
Classification, Clustering,
CSC504.3,
6 Association algorithm on
data sets using data mining CSC504.4,
tool(WEKA / R tool)
To implement Clustering
7 Algorithm. (K means). CSC504.4
Implementation of
Association rule
9 CSC504.4
Mining.(Apriori algorithm)
Implementation of Page
10 Rank Algorithm CSC504.5
EXPERIMENT NO. 1
Aim: One case study on building Data warehouse/Data Mart Write Detailed Problem statement
and design dimensional modeling (creation of star and snowflake schema)
Theory:
• LayersofaDataWarehouse Architecture
• While there can be various layers in a data warehouse architecture, there are a few
standardonesthatare responsible for the efficient functioning of the data warehouse
software.
DataMart
• Adatamartisorientedtoaspecificpurposeormajordatasubjectthatmaybedistributed to
support business needs.
• ItisasubsetoftheDataWarehouse/dataresource.
StarSchema:
• Itrepresentsthemultidimensionalmodel.
• InthismodelthedataisDimensionalModelingorganizedintofactsanddimensions.
• Thestarmodelistheunderlyingstructureforadimensionalmodel.
• Ithasonebroadcentraltable(facttable)and asetof smaller tables(dimensions)
arranged in a star design.
SnowflakeSchema
• Asnowflakeschemaisamulti-dimensionaldatamodelthatisanextensionofastar
schema, where dimension tables are broken down into sub-dimensions.
•
CaseStudy:
Problem Statement:
An anime recommendation platform wants to analyze user viewing preferences to build a more
personalized recommendation system. They have access to user ratings and detailed information about
anime shows. The goal is to identify viewing trends, most-watched genres, popular anime, and user
behavior patterns over time. To achieve this, a data warehouse is required to organize and process large
volumes of anime rating data efficiently.
Analysistobedone
Howtheaboveanalysisimprovesthebusinessi.e.aboveproblemdefinition
DesignInformationPackagediagram
fact_rating
– rating_id [PK]
– user_key [FK]
– anime_key [FK]
– date_key [FK]
– type_key [FK]
– user_rating
Conclusion:
R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-02
Aim:Implementationofalldimensiontable andfacttablebasedonexperiment1casestudy.
Theory:
• Implementation of each Dimension Tables, Fact Tables how in Star and Snowflake
schema using Create table Command.
• Screenshotsofdatapopulatedineverydimensiontableandfacttable.(atleast20entriesineach table)
Table: dim_anime
Purpose: Describes the "what" of the data. It contains details about each anime, such as its name,
genre, and type.
My Sql Command:
CREATE TABLE dim_anime (
anime_key INT PRIMARY KEY, anime_id INT, name VARCHAR(255), genre VARCHAR(255), type VARCHAR(50), episodes INT,
average_rating DECIMAL(4,2), members INT);
Insert command:
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (1, 32281, 'Kimi
no Na wa.', 'Drama, Romance, School, Supernatural', 'Movie', 1, 9.37, 200630);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (2, 5114,
'Fullmetal Alchemist: Brotherhood', 'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen', 'TV', 64, 9.26, 793665);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (3, 28977,
'Gintama°', 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen', 'TV', 51, 9.25, 114262);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (4, 9253,
'Steins;Gate', 'Sci-Fi, Thriller', 'TV', 24, 9.17, 673572);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (5, 9969,
'Gintama''', 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen', 'TV', 51, 9.16, 151266);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (6, 32935,
'Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou', 'Comedy, Drama, School, Shounen, Sports', 'TV', 10, 9.15, 93351);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (7, 11061, 'Hunter
x Hunter (2011)', 'Action, Adventure, Shounen, Super Power', 'TV', 148, 9.13, 425875);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (8, 820, 'Ginga
Eiyuu Densetsu', 'Drama, Military, Sci-Fi, Space', 'OVA', 110, 9.11, 80679);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (9, 15335,
'Gintama Movie: Kanketsu-hen - Yorozuya yo Eien Nare', 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen', 'Movie', 1,
9.10, 72534);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (10, 15417,
'Gintama'': Enchousen', 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen', 'TV', 13, 9.11, 81109);
Anime Aniem Name Genre Type Episodes Average Members
Key ID Rating
School, Supernatural
Military, Shounen
Historical, Parody,
Samurai, Sci-Fi,
Shounen
Historical, Parody,
Samurai, Sci-Fi,
Shounen
Koukou
8 820 Ginga Eiyuu Densetsu Drama, Military, Sci-Fi, OVA 110 9.11 80,679
Space
Shounen
10 15417 Gintama': Enchousen Action, Comedy, TV 13 9.11 81,109
Historical, Parody,
Samurai, Sci-Fi,
Shounen
Table: dim_users
Purpose: Describes the "who". It holds information about each user who provided a rating.
My Sql Command:
Insert command:
INSERT INTO dim_type (type_key, type_name) VALUES (1, 'TV');
INSERT INTO dim_type (type_key, type_name) VALUES (2, 'Movie');
INSERT INTO dim_type (type_key, type_name) VALUES (3, 'OVA');
INSERT INTO dim_type (type_key, type_name) VALUES (4, 'Special');
INSERT INTO dim_type (type_key, type_name) VALUES (5, 'ONA');
INSERT INTO dim_type (type_key, type_name) VALUES (6, 'Music');
UserKey User ID
1 101
2 102
3 103
4 104
5 105
6 106
7 107
8 108
9 109
10 110
Table: dim_type
MySqlCommand:
Insert command:
1 TV
2 Movie
3 OVA
4 Special
5 ONA
6 Music
Table: dim_date
Purpose: Describes the "when". It contains attributes related to the time of the rating, such as the year,
quarter, and day of the week.
My Sql Command:
Insert command:
INSERT INTO dim_date (date_key, full_date, year, quarter, month, day, day_of_week) VALUES (20230105, '2023-01-05', 2023, 1, 1,
5, 'Thursday');
INSERT INTO dim_date (date_key, full_date, year, quarter, month, day, day_of_week) VALUES (20230115, '2023-01-15', 2023, 1, 1,
15, 'Sunday');
INSERT INTO dim_date (date_key, full_date, year, quarter, month, day, day_of_week) VALUES (20230210, '2023-02-10', 2023, 1, 2,
10, 'Friday');
INSERT INTO dim_date (date_key, full_date, year, quarter, month, day, day_of_week) VALUES (20230320, '2023-03-20', 2023, 1, 3,
20, 'Monday');
INSERT INTO dim_date (date_key, full_date, year, quarter, month, day, day_of_week) VALUES (20230401, '2023-04-01', 2023, 2, 4,
1, 'Saturday');
Date Key Full Date Year Quarter Month Day Day of Week
Conclusion:
R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-03
Aim:ImplementationofOLAPoperations:Slice,Dice,Rollup,DrilldownandPivotbasedon
Experiment 1 case study.
Theory:
1. Rollup (drill-up):
ROLLUP is used in tasks involving subtotals. It creates subtotals at any level of aggregation
needed, from the most detailed up to a grand total i.e. climbing up a concept hierarchy for the
dimension such as time or geography.
Example:AQuerycouldinvolveaROLLUPofyear>month>dayorcountry>state>city.
QUESTION: From the fact_rating table (which stores user ratings) and the dim_date table (which contains
year, quarter, and month information)
QUESTION:
Write an SQL query to display the year, month, and day of each rating calculate the average user rating for
each day arrange the results in chronological order by year, month, and day.
Slicing:
A slice in a multidimensional array is a column of data corresponding to a single
valueforone or more members of the dimension. It helps the user to visualize and gather
the information specific to a dimension.
QUESTION:
3. Dicing:
Dicingissimilartoslicing,butitworksalittlebitdifferently. Whenonethinksofslicing, filtering
is done to focus on a particular attribute. Dicing, on the other hand, is more of a zoom
feature that selects a subset over all the dimensions, but for specific values of the
dimension.
QUESTION:
QUESTION:
2. Show the average user rating for each content type, broken down month-wise (Jan to Sep).
R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-04
Theory:
Naive Bayes model is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c),
P(x) and P(x|c).
Above, a training data set of weather and corresponding target variable ‘Play’
(suggesting possibilities of playing). Now, we need to classify whether players will play
or not based on weathercondition.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
• P(c|x) is the posterior probability of class (c, target) given predictor (x,attributes).
• P(c) is the prior probability ofclass.
• P(x|c) is the likelihood which is the probability of predictor givenclass.
• P(x) is the prior probability ofpredictor.
• Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class.
The class with the highest posterior probability is the outcomeof prediction.
• Problem: Players will play if weather is sunny. Is this statementcorrect?
Naive Bayes uses a similar method to predict the probability of different class
based on various attributes. This algorithm is mostly used in text classification and
with problems having multiple classes.
Advantages:
• It is easy and fast to predict class of test data set. It also perform well in multi class prediction
• When assumption of independence holds, a Naive Bayes classifier performs better compare to
other models like logistic regression and you need less training data.
• It perform well in case of categorical input variables compared to numerical variable(s). For
numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).
Disadvantages:
• If categorical variable has a category (in test data set), which was not observed in training data
set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This
is often known as “Zero Frequency”. To solve this, we can use the smoothing technique.
One of the simplest smoothing techniques is called Laplace estimation.
• Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is
almost impossible that we get a set of predictors which are completely independent.
PROGRAM:
import java.util.*;
)
dataset.add(new Anime("Movie", "1", "High"));
dataset.add(new Anime("TV", "64", "High"));
dataset.add(new Anime("TV", "51", "High"));
dataset.add(new Anime("TV", "24", "Low"));
dataset.add(new Anime("TV", "51", "Low"));
dataset.add(new Anime("TV", "10", "Low"));
dataset.add(new Anime("TV", "148", "Low"));
dataset.add(new Anime("OVA", "110", "Low"));
dataset.add(new Anime("Movie", "1", "Low"));
dataset.add(new Anime("TV", "13", "Low"));
INPUT &OUTPUT:
Conclusion:
R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-05
Theory: Data discretization refers to a method of converting a huge number of data values
into smaller ones so that the evaluation and management of data become easy. In other
words, data discretization is a method of converting attributes values of continuous data into
a finite set of intervals with minimum data loss. There are two forms of data discretization first
is supervised discretization, and the second is unsupervised discretization. Supervised
discretization refers to a method in which the class data is used. Unsupervised discretization
refers to a method depending upon the way which operation proceeds. It means it works on
the top-down splitting strategy and bottom-up merging strategy.
Example
Histogram analysis
Binning refers to a data smoothing technique that helps to group a huge number of continuous
values into smaller values. For data discretization and the development of idea hierarchy, this
technique can also be used.
Cluster Analysis
Data discretization refers to a decision tree analysis in which a top-down slicing technique is
used. It is done through a supervised procedure. In a numeric attribute discretization, first, you
need to select the attribute that has the least entropy, and then you need to run it with the help
of a recursive process. The recursive process divides it into various discretized disjoint
intervals, from top to bottom, using the same splitting criterion.
Discretizing data by linear regression technique, you can get the best neighbouring interval,
and then the large intervals are combined to develop a larger overlap to form the final 20
overlapping intervals. It is a supervised procedure.
Data visualization
Data visualization is actually a set of data points and information that are represented
graphically to make it easy and quick for user to understand. Data visualization is good if it
has a clear meaning, purpose, and is very easy to interpret, without requiring context. Tools
of data visualization provide an accessible way to see and understand trends, outliers, and
patterns in data by using visual effects or elements such as a chart, graphs, and maps.
• Histogram
A box plot is a graph that gives you a good indication of how the values in the
data are spread out. Although box plots may seem primitive in comparison to
a histogram or density plot, they have the advantage of taking up less space,
which is useful when comparing distributions between many groups or datasets.
Boxplot
900000
800000
700000
600000
500000
400000
300000
200000
100000
0
Movie TV OVA
• Scatter plots
Scatter plots are useful to display the relative density of two dimensions of data.
Well-designed ones quantify and correlate complex sets of data in an easy-to-read
manner. Often, these charts are used to discover trends and data, as much as they
are to visualize the data.
9.35
9.3
9.25
Rating
9.2
9.15
9.1
9.05
0 20 40 60 80 100 120 140 160
Episodes
• Matrix plots
These are the special types of plots that use two-dimensional matrix data for
visualization. It is difficult to analyze and generate patterns from matrix data because
of its large dimensions. So, this makes the process easier by providing color coding
to matrix data.
• Parallel Coordinates
The star plot (Chambers 1983) is a method of displaying multivariate data. Each
star represents a single observation. Typically, star plots are generated in a multi-
plot format with many stars on each page and each star representing one
observation.Star plots are used to examine the relative values for a single data
point.
• Dygraphs
The InstantAtlas team prepare and manage large statistical indicator data sets and
deliver community information systems, local observatories and knowledge hub
websites for clients as fully managed services. so that you can build your own
services and sites using ArcGIS Online and WordPress.
• Timeline
A timeline is a great data visualization technique when you wish to show data in a
chronological order and highlighting those important points in time. To create a
Timeline, simply layout your data points along a PowerPoint shape, and mark the
data off to visually see your overall project.
PROGRAM:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
import statistics
import math
from collections import OrderedDict
x = []
print("\nEnter the data:")
x = list(map(float, input().split()))
print("\nEnter the number of bins:")
bi = int(input())
X_dict = OrderedDict()
x_old = {}
x_new = {}
for i in range(len(x)):
X_dict[i] = x[i]
x_old[i] = x[i]
for g, h in X_dict.items():
if i < num_of_data_in_each_bin:
avrg += h
i += 1
else:
k += 1
i = 1
binn.append(round(avrg / num_of_data_in_each_bin, 3))
avrg = h
rem = len(x) % bi
if rem == 0:
binn.append(round(avrg / num_of_data_in_each_bin, 3))
else:
binn.append(round(avrg / rem, 3))
i = 0
j = 0
for g, h in X_dict.items():
if i < num_of_data_in_each_bin:
x_new[g] = binn[j]
i += 1
else:
i = 1
j += 1
x_new[g] = binn[j]
Output:
Conclusion:
R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-06
Waikato Environment for Knowledge Analysis (Weka) is a popular suite of machine learning
software written in Java, developed at the University of Waikato, New Zealand. It is free
software licensed under the GNU General Public License.
Weka is a workbench that contains a collection of visualization tools and algorithms for
data analysis and predictive modeling, together with graphical user interfaces for easy access
to these functions.
This original version was primarily designed as a tool for analyzing data from agricultural
domains, but the more recent fully Java-based version (Weka 3), for which development
started in 1997, is now used in many different application areas, in particular for educational
purposes and research.
• Free availability under the GNU General Public License.Portability, since it is fully
implemented in the Java programming language and thus runs on almost any modern
computing platform.
• A comprehensive collection of data preprocessing and modeling techniques.
Weka supports several standard data mining tasks, more specifically, data preprocessing,
clustering, classification, regression, visualization, and feature selection. All of Weka's
techniques are predicated on the assumption that the data is available as one flat file or
relation, where each data point is described by a fixed number of attributes (normally,
numeric or nominal attributes, but some other attribute types are also supported). Weka
provides access to SQL databases using Java Database Connectivity and can process the
result returned by a database query. It is not capable of multi-relational data mining, but
there is separate software for converting a collection of linked database tables into a single
table that is suitable for processing using Weka.
Weka's main user interface is the Explorer, but essentially the same functionality can be
accessed through the component-based Knowledge Flow interface and from thecommand
line. There is also the Experimenter, which allows the systematic comparison of the predictive
performance of Weka's machine learning algorithms on a collection of datasets.
The Explorer interface features several panels providing access to the main components of
the workbench:
The Preprocess panel has facilities for importing data from a database, a comma-
separated values (CSV) file, etc., and for preprocessing this data using a so-called filtering
algorithm. These filters can be used to transform the data (e.g., turning numeric
attributes into discrete ones) and make it possible to delete instances and attributes
according to specific criteria.
predictions, receiver operating characteristic (ROC) curves, etc., or the model itself
The Associate panel provides access to association rule learners that attempt to
identify all important interrelationships between attributes in the data.
The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple
k-means algorithm. There is also an implementation of the expectation maximization
algorithm for learning a mixture of normal distributions.
The Select attributes panel provides algorithms for identifying the most predictive
attributes in a dataset.
The Visualize panel shows a scatter plot matrix, where individual scatter plots can be
selected and enlarged, and analyzed further using various selection operators.
Preprocessing in WEKA
In the "Filter" panel, click on the "Choose" button. This will show a popup window
with a list available filters. Scroll down the list and select the
"weka.filters.unsupervised.attribute.Remove" filter as shown in Figure.
Classification using WEKA:
This experiment illustrates the use of naïve bayes classifier in weka. Consider the sample
data set “employee”data available at arff format. This document assumes that appropriate
data pre processing has been performed.
Step2: Next we select the “classify” tab and click “choose” button to select the “Naïve
Bayes”classifier.
Step3: Now specify the various parameters. These can be specified by clicking in the text box
to the right of the chose button. In this example, accept the default values his default version
does perform some pruning but does not perform error pruning.
Step4: Under the “text “options in the main pane l. select the 10-fold cross validation as our
evaluation approach. Since we don’t have separate evaluation data set, this is necessary to
get a reasonable idea of accuracy of generated model.
Step-5: nowclick”start”to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy ofmodel is about 69%.this indicates that we may
find more work. (Either in preprocessing or in selecting current parameters for the
classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree. This
can be done by right clicking the last result set and selecting “visualize tree” from the
pop-up menu.
Step-9: In the main panel under “text “options click the “supplied test set” radio button and
then click the “set” button. This will show pop-up window which will allow you to open the file
containing test instances.
@relation employee
@attribute age {25, 27, 28, 29, 30, 35, 48} @attribute
salary{10k,15k,17k,20k,25k,30k,35k,32k} @attribute
performance {good, avg, poor} @data
The following screenshot shows the classification rules that were generated when naive bayes
algorithm is applied on the given dataset
Clustering Using WEKA:
This experiment illustrates the use of simple k-mean clustering with Weka explorer.
The sample data set used for this example is based on the iris data available in ARFF format.
This document assumes that appropriate preprocessing has been performed. This iris dataset
includes 150 instances.
Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.
Step 2: In order to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.
Step 4: Next click in text button to the right of the choose button to get popup window shown
in the screenshots. In this window we enter six on the number of clusters and we leave the
value of the seed on as it is. The seed value is used in generating a random number which is
used for making the internal assignments of instances of clusters.
Step 5 : Once of the option have been specified. We run the clustering algorithm there we
must make sure that they are in the ‘cluster mode’ panel. The use of training set option is
selected and then we click ‘start’ button. This process and resulting window are shown in the
following screenshots.
Step 6 : The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters centroid are
means vectors for each clusters. This clusters can be used to characterized the cluster. For
eg, the centroid of cluster1 shows the class iris.versicolor mean value of the sepal length is
5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.
The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.
Interpretation of the above visualization
From the above visualization, we can understand the distribution of sepal length
and petal length in each cluster. For instance, for each cluster is dominated by petal
length. In this case by changing the color dimension to other attributes we can see
their distribution with in each of the cluster.
Step 8: We can assure that resulting dataset which included each instance along
with its assign cluster. To do so we click the save button in the visualization window
and save the result iris k-mean .The top portion of this file is shown in the following
figure.
Association Rule Mining in WEKA:
This experiment illustrates some of the basic elements of asscociation rule mining
using WEKA. The sample dataset used for this example is test.arff
Step1: Open the data file in Weka Explorer. It is presumed that the required data
fields have been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for
association rule algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence
etc) we click on the text box immediately to the right of the choose button.
Dataset test.arff
@relation test
@attributeadmissionyear {2005,2006,2007,2008,2009,2010}
@data
2005, cse
2005, it
2005, cse
2006, mech
2006, it
2006, ece
2007, it
2007, cse
2008, it
2008, cse
2009, it
2009, ece
The following screenshot shows the association rules that were generated when
apriori algorithm is applied on the given dataset.
1. Preprocessing data
2.Classification
3.Clustering
4. Association
Conclusion:
R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-07
Theory:
Clustering is the process of grouping the data into classes or clusters, so that objects within a
cluster have high similarity in comparison to one another but are very dissimilar to objects in other
clusters. Dissimilarities are assessed based on the attribute values describing the objects. Often,
distance measures are used. Clustering has its roots in many areas, including data mining, statistics,
biology, and machine learning.
Partitioning Methods
Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm
organizes the objects into k partitions (k _ n), where each partition represents a cluster. The
clusters are formed to optimize an objective partitioning criterion,
The k-means algorithm takes the input parameter, k, and partitions a set of n objects
into k clusters so that the resulting intracluster similarity is high but the intercluster
similarity is low. Cluster similarity is measured in regard to the mean value of the
objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.
Advantages
• Easy to implement
• An instance can change cluster (move to another cluster) when the centroids are re-
‐ computed.
Disadvantages
Applications:
The K-means clustering algorithm is used to find groups which have not been explicitly
labeled in the data. This can be used to confirm business assumptions about what types of
groups exist or to identify unknown groups in complex data sets. Once the algorithm has been
run and the groups are defined, any new data can be easily assigned to the correct group.
This is a versatile algorithm that can be used for any type of grouping. Some
examples of use cases are:
Behavioral segmentation:
o Group images
o Separate audio
o Identify groups in health monitoring
Detecting bots or anomalies:
o Separate valid activity groups from bots
PROGRAM:
import java.util.Scanner;
int n, k, i, j, iter;
int[] cluster = new int[MAX_POINTS];
double[] x = new double[MAX_POINTS];
double[] y = new double[MAX_POINTS];
double[] centroidX = new double[MAX_CLUSTERS];
double[] centroidY = new double[MAX_CLUSTERS];
double[] newCentroidX = new double[MAX_CLUSTERS];
double[] newCentroidY = new double[MAX_CLUSTERS];
int[] count = new int[MAX_CLUSTERS];
boolean changed;
// Input coordinates
System.out.println("Enter the coordinates (x y) for each point:");
for (i = 0; i < n; i++) {
x[i] = sc.nextDouble();
y[i] = sc.nextDouble();
cluster[i] = -1; // initialize cluster assignment
}
// Final Output
System.out.println("\nFinal Clusters:");
for (j = 0; j < k; j++) {
System.out.print("Cluster " + (j + 1) + ": ");
for (i = 0; i < n; i++) {
if (cluster[i] == j) {
System.out.printf("(%.2f, %.2f) ", x[i], y[i]);
}
}
System.out.println();
}
sc.close();
}
}
INPUT &OUTPUT:
Conclusion:
R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-08
A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data points as a separate cluster. Then, it repeatedly executes the
subsequent steps:
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram
called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of merges or splits)
graphically represents this hierarchy and is an inverted tree that describes the order in which factors are
merged (bottom-up view) or cluster are break up (top-down view).
This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic
clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain
termination conditions are satisfied.
Given
Step1: Assign each object to a cluster so that for N objects we have N clusters each containing just one
Object.
Step2: Let the distances between the clusters be the same as the distances between the objects they
contain.
Step3: Find the most similar pair of clusters and merge them into a single cluster so that we now have one
cluster less.
Step4: Compute distances between the new cluster and each of the old clusters.
Step5: Repeat steps 3 and 4 until all items are clustered into a single cluster of size N.
• Step 4 can be done in different ways and this distinguishes single and complete linkage.
o clustering process is terminated when the maximum distance between nearest clusters
exceeds an arbitrary threshold.
o clustering process is terminated when the minimum distance between nearest clusters
exceeds an arbitrary threshold.
o EXAMPLE:
Suppose this data is to be clustered.
• In this example, cutting the tree after the second row of the dendrogram will yield clusters {a} {b
c} {d e} {f}.
• Cutting the tree after the third row will yield clusters {a} {b c} {d e f}, which is a coarser
clustering, with a smaller number but larger clusters.
In our example, we have six elements {a} {b} {c} {d} {e} and {f}.
Usually, we take the two closest elements, according to the chosen distance.
Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances
updated. Suppose we have merged the two closest elements b and c, we now have the following clusters
{a}, {b, c}, {d}, {e} and {f}, and want to merge them further.
To do that, we need to take the distance between {a} and {b c}, and therefore define the distance
between two clusters. Usually the distance between two clusters A and B is one of the following:
• The maximum distance between elements of each cluster (also called complete-linkage clustering):
max {d(x,y):x∈A,y∈B}
• The minimum distance between elements of each cluster (also called single-linkage clustering):
min {d(x,y):x∈A,y∈B}
• The mean distance between elements of each cluster (also called average linkage clustering):
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and
one can decide to stop clustering either when the clusters are too far apart to be merged (distance
criterion) or when there is a sufficiently small number of clusters (number criterion).
PROGRAM:
import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster.hierarchy import dendrogram
print("\nCluster assignments:")
for idx, lab in enumerate(labels, start=1):
print(f"Point {idx} ({points[idx-1][0]}, {points[idx-1][1]}) -> Cluster {lab}")
Z = np.array(merge_history)
plt.figure(figsize=(10,6))
dendrogram(Z, labels=[f"P{i+1}" for i in range(n)], color_threshold=0)
plt.title("Dendrogram (single-linkage, manual calculation)")
plt.ylabel("Distance (Euclidean)")
plt.show()
plt.figure(figsize=(6,6))
colors = plt.cm.get_cmap('tab10')
for i in range(n):
plt.scatter(points[i,0], points[i,1], color=colors((labels[i]-1) % 10), s=80,
edgecolor='k')
plt.text(points[i,0]+0.02, points[i,1]+0.02, f"P{i+1}", fontsize=9)
plt.title(f"Points colored by cluster (k={k})")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.show()
INPUT &OUTPUT:
Conclusion:
R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-09
Theory:
Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear in a data set frequently.
For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a
frequent itemset. Finding such frequent patterns plays an essential role in mining associations, correlations, and many
other interesting relationships among data.
Moreover, it helps in data classification, clustering, and other data mining tasks
as well. Thus, frequent pattern mining has become an important data mining task
and a focused theme in data mining research.
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets
for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge
of frequent itemset properties, as we shall see following. Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are usedtoexplore (k+1)-itemsets. First, the setof frequent 1-itemsets is found by scanning
the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so
on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an important property
called the Apriori property, presented below, is used to reduce the search space.We will first describe
this property, and then show an example illustrating its use.
Disadvantages
• Sometimes, it may need to find a large number of candidate rules which can be
computationally expensive.
• Calculating support is also expensive because it has to go through the entire database.
Consider the following example. Before beginning the process, let us set the support threshold to
50%, i.e. only those items are significant for which support is more than 50%.
Example:
Step 1: Create a frequency table of all the items that occur in all the transactions. For our case:
Onion(O) 4
Potato(P) 5
Burger(B) 4
Milk(M) 4
Beer(Be) 2
Step 2: We know that only those elements are significant for which the support is greater than
or equal to the threshold support. Here, support threshold is 50%, hence only those items are
significant which occur in more than three transactions and such items are Onion(O),
Potato(P), Burger(B), and Milk(M). Therefore, we are left with:
Onion(O) 4
Potato(P) 5
Burger(B) 4
Milk(M) 4
The table above represents the single items that are purchased by the customers frequently.
Step 3: The next step is to make all the possible pairs of the significant items keeping in mind that
the order doesn’t matter, i.e., AB is same as BA. To do this, take the first item and pair it with all the
others such as OP, OB, OM. Similarly, consider the second item and pair it with preceding items,
i.e., PB, PM. We are only considering the preceding items because PO (same as OP) already
exists. So, all the pairs in our example are OP, OB, OM, PB, PM, BM.
Step 4: We will now count the occurrences of each pair in all the transactions.
OP 4
OB 3
OM 2
PB 4
PM 3
BM 2
Step 5: Again only those itemsets are significant which cross the support threshold, and
those are OP, OB, PB, and PM.
Step 6: Now let’s say we would like to look for a set of three items that are purchased together.
We will use the itemsets found in step 5 and create a set of 3 items.
To create a set of 3 items another rule, called self-join is required. It says that from the item pairs OP,
OB, PB and PM we look for two pairs with the identical first letter and so we get
OPB 4
PBM 3
Applying the threshold rule again, we find that OPB is the only significant itemset.
Therefore, the set of 3 items that was purchased most frequently is OPB.
The example that we considered was a fairly simple one and mining the frequent itemsets stopped at 3 items but in
practice, there are dozens of items and this process could continue to many items. Suppose we got the significant
sets with 3 items as OPQ, OPR, OQR, OQS and PQR and now we want to generate the set of 4 items. For this, we
will look at the sets which have first two alphabets common, i.e,
In general, we have to look for sets which only differ in their last letter/item.
Applications:
Customer analysis
PROGRAM:
import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster.hierarchy import dendrogram
print("\nCluster assignments:")
for idx, lab in enumerate(labels, start=1):
print(f"Point {idx} ({points[idx-1][0]}, {points[idx-1][1]}) -> Cluster {lab}")
Z = np.array(merge_history)
plt.figure(figsize=(10,6))
dendrogram(Z, labels=[f"P{i+1}" for i in range(n)], color_threshold=0)
plt.title("Dendrogram (single-linkage, manual calculation)")
plt.ylabel("Distance (Euclidean)")
plt.show()
plt.figure(figsize=(6,6))
colors = plt.cm.get_cmap('tab10')
for i in range(n):
plt.scatter(points[i,0], points[i,1], color=colors((labels[i]-1) % 10), s=80,
edgecolor='k')
plt.text(points[i,0]+0.02, points[i,1]+0.02, f"P{i+1}", fontsize=9)
plt.title(f"Points colored by cluster (k={k})")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.show()
INPUT &OUTPUT:
Conclusion:Note:-
R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-10
PageRank works by counting the number and quality of links to a page to determine a
rough estimate of how important the website is. The underlying assumption is that more
important websites are likely to receive more links from other websites.
It is not the only algorithm used by Google to order search engine results, but it is the first
algorithm that was used by the company, and it is the best-known.The above centrality
measure is not implemented for multi-graphs.
Algorithm:
The PageRank algorithm outputs a probability distribution used to represent the likelihood
that a person randomly clicking on links will arrive at any particular page. PageRank can be
calculated for collections of documents of any size. The PageRank computations require several
passes, called “iterations”, through the collection to adjust approximate PageRank values to
more closely reflect the theoretical true value.
Working:
Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or multiple
outbound links from one single page to another single page, are ignored. PageRank is initialized to the same
value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number
of pages on the web at that time, so each page in this example would have an initial value of 1. However, later
versions of PageRank, and the remainder of this section, assume a probability distribution between 0 and 1.
Hence the initial value for each page in this example is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon the
next iteration is divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would transfer
0.25 PageRank to A upon the next iteration, for a total of 0.75.
PR(A)=PR(B)+PR(C)+PR(D)
Suppose instead that page B had a link to pages C and A, page C had a link to page A, and
page D had links to all three pages. Thus, upon the first iteration, page B would transfer half of its
existing value, or 0.125, to page A and the other half, or 0.125, to page C. Page C would transfer all
of its existing value, 0.25, to the only page it links to, A. Since D had three outbound links, it would
transfer one third of its existing value, or approximately 0.083, to A. At the completion of this iteration,
page A will have a PageRank of approximately 0.458.
PR(A)=PR(B)/2+PR(C)/1+PR(D)/3
In other words, the PageRank conferred by an outbound link is equal to the document’s own
PageRank score divided by the number of outbound links L( ).
PR(A)=PR(B)/L(B)+PR(C)/L(C)+PR(D)/L(D)
In the general case, the PageRank value for any page u can be expressed as:
i.e., the PageRank value for a page u is dependent on the PageRank values for each
page v contained in the set Bu (the set containing all pages linking to page u), divided by the
number L(v) of links from page v. The algorithm involves a damping factor for the calculation
of the page rank.
PROGRAM:
nodes = input("Enter node names separated by space (e.g., A B C D): ").split()
n = len(nodes)
node_index = {nodes[i]: i for i in range(n)}
adj_matrix = [[0]*n for _ in range(n)]
print("\nEnter outgoing links for each node (space-separated). Leave empty if no
outgoing links.")
for node in nodes:
links = input(f"Outgoing links from {node}: ").split()
for link in links:
if link not in node_index:
print(f"Warning: {link} is not a valid node. Skipping.")
continue
adj_matrix[node_index[node]][node_index[link]] = 1
while True:
try:
d = float(input("\nEnter damping factor (0-1, e.g., 0.85): "))
if 0 < d < 1:
break
else:
print("Please enter a number between 0 and 1.")
except:
print("Invalid input. Enter a decimal number between 0 and 1.")
epsilon = 0.0001
PR = [1/n]*n
out_degree = [sum(row) for row in adj_matrix]
print("\nPageRank Iterations:\n")
print("Iteration\t" + "\t".join(nodes))
iteration = 0
while True:
iteration += 1
new_PR = [0]*n
for i in range(n):
rank_sum = 0
for j in range(n):
if adj_matrix[j][i] == 1 and out_degree[j] != 0:
rank_sum += PR[j] / out_degree[j]
new_PR[i] = (1 - d)/n + d * rank_sum
print(f"{iteration}\t\t" + "\t".join(f"{x:.4f}" for x in new_PR))
if all(abs(new_PR[i] - PR[i]) < epsilon for i in range(n)):
PR = new_PR
break
PR = new_PR
ranking = sorted([(nodes[i], PR[i]) for i in range(n)], key=lambda x: x[1],
reverse=True)
print("\nFinal Node Ranking:")
for node, _ in ranking:
print(f"{node} -> ", end="")
print("END")
print("\nOrdered nodes by rank:", " -> ".join([node for node, _ in ranking]))
INPUT &OUTPUT:
Conclusion:
R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)