Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
1 views78 pages

Dw m Practical

Uploaded by

waghjayesh07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views78 pages

Dw m Practical

Uploaded by

waghjayesh07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

DATTA MEGHE COLLEGE OF ENGINEERING,

Airoli, Navi Mumbai


Department of Computer Engineering
Academic Year: 2025-26 (Term I)

LAB MANUAL
(ACADEMIC RECORD)

NAME OF THE SUBJECT: Data Warehousing and Mining

CLASS: TE

SEMESTER: V
Program Outcomes as defined by NBA (PO)
Engineering Graduates will be able to:

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and an
engineering specialization to the solution of complex engineering problems.

2. Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with an
understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities relevant to the professional
engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of
the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and
design documentation, make effective presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the engineering
and management principles and apply these to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
Institute Vision : To create value - based technocrats to fit in the world of
work and research

Institute Mission : To adapt the best practices for creating competent


humanbeings
to work in the world of technology and research.

Department Vision : To provide an intellectually stimulating environment for


education, technological excellence in computer
engineering field and professional training along with
human values.
Department Mission:
M1: To promote an educational environment that combines academics with intellectual
curiosity.
M2: To develop human resource with sound knowledge of theory and practical in the
discipline of Computer Engineering and the ability to apply the knowledge to the benefit
of society at large.
M3: To assimilate creative research and new technologies in order to facilitate students to be
a lifelong learner who will contribute positively to the economic well-being of the
nation.
Program Educational Objectives (PEO):
PEO1: To explicate optimal solutions through application of innovative computer
science techniques that aid towards betterment of society.
PEO2: To adapt recent emerging technologies for enhancing their career opportunity
prospects.

PEO3: To effectively communicate and collaborate as a member or leader in a team to


manage multidisciplinary projects
PEO4: To prepare graduates to involve in research, higher studies or to become
entrepreneurs in long run.
Program Specific Outcomes (PSO):
PSO1: To apply basic and advanced computational and logical skills to provide solutions to
computer engineering problems
PSO2: Ability to apply standard practices and strategies in design and development of software
and hardware based systems and adapt to evolutionary changes in computing to meet the
challenges of the future.
PSO3: To develop an approach for lifelong learning and utilize multi-disciplinary knowledge
required for satisfying industry or global requirements.
DATTA MEGHE COLLEGE OF ENGINEERING
Department of Computer Engineering

Course Name: Data Warehousing and Mining (R-19)


Course Code: CSC504
Year of Study: T.E., Semester: V

Course Outcomes

Understand data warehouse fundamentals and design data warehouse with


CSC504.1 dimensional modelling and apply OLAP operations.

Understand data mining principles and perform Data preprocessing and


CSC504.2
Visualization

CSC504.3 Identify appropriate data mining algorithms to solve real world problems.

Compare and evaluate different data mining techniques like classification,


CSC504.4
prediction, clustering and association rule mining

Describe complex information and social networks with respect to web mining.
CSC504.5
DATTA MEGHE COLLEGE OF ENGINEERING
AIROLI, NAVI MUMBAI - 400708

CERTIFICATE

This is to certify that Mr. / Miss ___________________________________________________of

____________ Class ____________________________Roll No. _____________________

Subject __________________________________________ has performed the experiments /

Sheets mentioned in the index, in the premises of this institution.

______________________ ______________ ____________

Practical Incharge Head of Dept. Principal

Date _________________

Examined on:

Examiner 1 _____________________________ Examiner 2 ___________________________


DEPARTMENT OF COMPUTER ENGINEERING
ACADEMIC YEAR : 2025 – 26 (TERM – I)

List of Experiments
Course Name :Data Warehousing and Mining
Datta Meghe College of Engineering
Course Code :CSC504
Airoli, Navi Mumbai

Experiment Page no. Date Signature


Name of the Experiment CO covered
No
Case study on building
1 Data Warehouse/ Data Mart CSC504.1

Implementation of all
dimension tables and fact
table based on experiment 1
2 CSC504.1
case study.

To implement OLAP
operations : Slice, Dice,
Roll up, Drill down, and
3 CSC504.1
Pivot based on experiment 1
case study.

Implementation of
4 Bayesian Classification CSC504.3
Algorithm.
Implementation of Data
Discretization and
5 Visualization CSC504.2
Perform data preprocessing
task and demonstrate CSC504.2,
Classification, Clustering,
CSC504.3,
6 Association algorithm on
data sets using data mining CSC504.4,
tool(WEKA / R tool)

To implement Clustering
7 Algorithm. (K means). CSC504.4

Implementation of any one


8 Hierarchical Clustering CSC504.4
method

Implementation of
Association rule
9 CSC504.4
Mining.(Apriori algorithm)

Implementation of Page
10 Rank Algorithm CSC504.5
EXPERIMENT NO. 1

Aim: One case study on building Data warehouse/Data Mart Write Detailed Problem statement
and design dimensional modeling (creation of star and snowflake schema)

Software used: Any online drawing tool.

Theory:

What is Data Warehouse Architecture?


• A data warehouse architecture defines the design of the data warehouse
thathelpsstreamline the collection, storage, and utilizationof data gathered
fromdisparatesourcesforanalyticalpurposes. Well-developed data
warehousearchitecturedecides the efficiency of collecting raw data and transforming itto
make it valuable for business processes.

• LayersofaDataWarehouse Architecture
• While there can be various layers in a data warehouse architecture, there are a few
standardonesthatare responsible for the efficient functioning of the data warehouse
software.

DataMart
• Adatamartisorientedtoaspecificpurposeormajordatasubjectthatmaybedistributed to
support business needs.
• ItisasubsetoftheDataWarehouse/dataresource.

StarSchema:
• Itrepresentsthemultidimensionalmodel.
• InthismodelthedataisDimensionalModelingorganizedintofactsanddimensions.
• Thestarmodelistheunderlyingstructureforadimensionalmodel.
• Ithasonebroadcentraltable(facttable)and asetof smaller tables(dimensions)
arranged in a star design.

SnowflakeSchema
• Asnowflakeschemaisamulti-dimensionaldatamodelthatisanextensionofastar
schema, where dimension tables are broken down into sub-dimensions.

CaseStudy:
Problem Statement:
An anime recommendation platform wants to analyze user viewing preferences to build a more
personalized recommendation system. They have access to user ratings and detailed information about
anime shows. The goal is to identify viewing trends, most-watched genres, popular anime, and user
behavior patterns over time. To achieve this, a data warehouse is required to organize and process large
volumes of anime rating data efficiently.

Analysistobedone

Howtheaboveanalysisimprovesthebusinessi.e.aboveproblemdefinition

DesignInformationPackagediagram

Details of Dimension table:

dim_anime dim_users dim_date


– date_key
– anime_key – user_key
– full_date
– anime_id – user_id
– year
– name
– quarter
– genre
– month
– type
– day
– episodes
– day_of_week
– average_rating
– members
Details of Fact table

fact_rating
– rating_id [PK]
– user_key [FK]
– anime_key [FK]
– date_key [FK]
– type_key [FK]
– user_rating

Draw and attach Star Schema


Draw and attach Snowflake Schema (if applicable to your project):
– In the Snowflake Schema, Dimensions are normalized:

– dim_anime -> dim_type


– anime_key | anime_id | name | episodes | average_rating | members | type_key
– type_key | type_name

Conclusion:

R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-02

Aim:Implementationofalldimensiontable andfacttablebasedonexperiment1casestudy.

Software used: MySQL

Theory:

• Implementation of each Dimension Tables, Fact Tables how in Star and Snowflake
schema using Create table Command.

• Insert 20tuplesin each Tablesusinginsertcommand.

• Screenshotsofdatapopulatedineverydimensiontableandfacttable.(atleast20entriesineach table)

Table: dim_anime

Purpose: Describes the "what" of the data. It contains details about each anime, such as its name,
genre, and type.

My Sql Command:
CREATE TABLE dim_anime (
anime_key INT PRIMARY KEY, anime_id INT, name VARCHAR(255), genre VARCHAR(255), type VARCHAR(50), episodes INT,
average_rating DECIMAL(4,2), members INT);

Insert command:
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (1, 32281, 'Kimi
no Na wa.', 'Drama, Romance, School, Supernatural', 'Movie', 1, 9.37, 200630);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (2, 5114,
'Fullmetal Alchemist: Brotherhood', 'Action, Adventure, Drama, Fantasy, Magic, Military, Shounen', 'TV', 64, 9.26, 793665);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (3, 28977,
'Gintama°', 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen', 'TV', 51, 9.25, 114262);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (4, 9253,
'Steins;Gate', 'Sci-Fi, Thriller', 'TV', 24, 9.17, 673572);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (5, 9969,
'Gintama''', 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen', 'TV', 51, 9.16, 151266);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (6, 32935,
'Haikyuu!!: Karasuno Koukou VS Shiratorizawa Gakuen Koukou', 'Comedy, Drama, School, Shounen, Sports', 'TV', 10, 9.15, 93351);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (7, 11061, 'Hunter
x Hunter (2011)', 'Action, Adventure, Shounen, Super Power', 'TV', 148, 9.13, 425875);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (8, 820, 'Ginga
Eiyuu Densetsu', 'Drama, Military, Sci-Fi, Space', 'OVA', 110, 9.11, 80679);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (9, 15335,
'Gintama Movie: Kanketsu-hen - Yorozuya yo Eien Nare', 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen', 'Movie', 1,
9.10, 72534);
INSERT INTO dim_anime (anime_key, anime_id, name, genre, type, episodes, average_rating, members) VALUES (10, 15417,
'Gintama'': Enchousen', 'Action, Comedy, Historical, Parody, Samurai, Sci-Fi, Shounen', 'TV', 13, 9.11, 81109);
Anime Aniem Name Genre Type Episodes Average Members

Key ID Rating

1 32281 Kimi no Na wa. Drama, Romance, Movie 1 9.37 200,630

School, Supernatural

2 5114 Fullmetal Alchemist: Action, Adventure, TV 64 9.26 793,665

Brotherhood Drama, Fantasy, Magic,

Military, Shounen

3 28977 Gintama° Action, Comedy, TV 51 9.25 114,262

Historical, Parody,

Samurai, Sci-Fi,

Shounen

4 9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673,572

5 9969 Gintama' Action, Comedy, TV 51 9.16 151,266

Historical, Parody,

Samurai, Sci-Fi,

Shounen

6 32935 Haikyuu!!: Karasuno Comedy, Drama, TV 10 9.15 93,351

Koukou VS School, Shounen,

Shiratorizawa Gakuen Sports

Koukou

7 11061 Hunter x Hunter (2011) Action, Adventure, TV 148 9.13 425,875

Shounen, Super Power

8 820 Ginga Eiyuu Densetsu Drama, Military, Sci-Fi, OVA 110 9.11 80,679

Space

9 15335 Gintama Movie: Action, Comedy, Movie 1 9.10 72,534

Kanketsu-hen - Historical, Parody,

Yorozuya yo Eien Nare Samurai, Sci-Fi,

Shounen
10 15417 Gintama': Enchousen Action, Comedy, TV 13 9.11 81,109

Historical, Parody,

Samurai, Sci-Fi,

Shounen

Table: dim_users

Purpose: Describes the "who". It holds information about each user who provided a rating.

My Sql Command:

CREATE TABLE dim_type (


type_key INT PRIMARY KEY, type_name VARCHAR (50));

Insert command:
INSERT INTO dim_type (type_key, type_name) VALUES (1, 'TV');
INSERT INTO dim_type (type_key, type_name) VALUES (2, 'Movie');
INSERT INTO dim_type (type_key, type_name) VALUES (3, 'OVA');
INSERT INTO dim_type (type_key, type_name) VALUES (4, 'Special');
INSERT INTO dim_type (type_key, type_name) VALUES (5, 'ONA');
INSERT INTO dim_type (type_key, type_name) VALUES (6, 'Music');

UserKey User ID

1 101

2 102

3 103

4 104

5 105

6 106

7 107

8 108

9 109

10 110

Table: dim_type

Purpose: Provides context about the "type" of the anime.

MySqlCommand:

CREATE TABLE dim_type (


type_key INT PRIMARY KEY, type_name VARCHAR(50));

Insert command:

INSERT INTO dim_type (type_key, type_name) VALUES (1, 'TV');


INSERT INTO dim_type (type_key, type_name) VALUES (2, 'Movie');
INSERT INTO dim_type (type_key, type_name) VALUES (3, 'OVA');
INSERT INTO dim_type (type_key, type_name) VALUES (4, 'Special');
INSERT INTO dim_type (type_key, type_name) VALUES (5, 'ONA');
INSERT INTO dim_type (type_key, type_name) VALUES (6, 'Music');

Type Key Type Name

1 TV

2 Movie

3 OVA

4 Special

5 ONA

6 Music

Table: dim_date

Purpose: Describes the "when". It contains attributes related to the time of the rating, such as the year,
quarter, and day of the week.

My Sql Command:

CREATE TABLE dim_date (


date_key INT PRIMARY KEY, full_date DATE, year INT, quarter INT, month INT, day INT, day_of_week VARCHAR(10));

Insert command:

INSERT INTO dim_date (date_key, full_date, year, quarter, month, day, day_of_week) VALUES (20230105, '2023-01-05', 2023, 1, 1,
5, 'Thursday');
INSERT INTO dim_date (date_key, full_date, year, quarter, month, day, day_of_week) VALUES (20230115, '2023-01-15', 2023, 1, 1,
15, 'Sunday');
INSERT INTO dim_date (date_key, full_date, year, quarter, month, day, day_of_week) VALUES (20230210, '2023-02-10', 2023, 1, 2,
10, 'Friday');
INSERT INTO dim_date (date_key, full_date, year, quarter, month, day, day_of_week) VALUES (20230320, '2023-03-20', 2023, 1, 3,
20, 'Monday');
INSERT INTO dim_date (date_key, full_date, year, quarter, month, day, day_of_week) VALUES (20230401, '2023-04-01', 2023, 2, 4,
1, 'Saturday');
Date Key Full Date Year Quarter Month Day Day of Week

20230105 2023-01-05 2023 1 1 5 Thursday

20230115 2023-01-15 2023 1 1 15 Sunday

20230210 2023-02-10 2023 1 2 10 Friday

20230320 2023-03-20 2023 1 3 20 Monday

20230401 2023-04-01 2023 2 4 1 Saturday

20230512 2023-05-12 2023 2 5 12 Friday

20230625 2023-06-25 2023 2 6 25 Sunday

20230707 2023-07-07 2023 3 7 7 Friday

20230819 2023-08-19 2023 3 8 19 Saturday

20230930 2023-09-30 2023 3 9 30 Saturday

Conclusion:

R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-03

Aim:ImplementationofOLAPoperations:Slice,Dice,Rollup,DrilldownandPivotbasedon
Experiment 1 case study.

Software used: MySQL

Theory:

1. Rollup (drill-up):
ROLLUP is used in tasks involving subtotals. It creates subtotals at any level of aggregation
needed, from the most detailed up to a grand total i.e. climbing up a concept hierarchy for the
dimension such as time or geography.
Example:AQuerycouldinvolveaROLLUPofyear>month>dayorcountry>state>city.

QUESTION: From the fact_rating table (which stores user ratings) and the dim_date table (which contains
year, quarter, and month information)

MYSQL Query And OUPUT:


2. Drilldown(Rolldown):
ThisisareverseoftheROLLUPoperationdiscussedabove.Thedataisaggregatedfrom a higher
level summary to a lower level summary/detailed data.

QUESTION:

Write an SQL query to display the year, month, and day of each rating calculate the average user rating for
each day arrange the results in chronological order by year, month, and day.

MYSQL Query And OUPUT:

Slicing:
A slice in a multidimensional array is a column of data corresponding to a single
valueforone or more members of the dimension. It helps the user to visualize and gather
the information specific to a dimension.

QUESTION:

Write an SQL query to:

1. Display the type name of the content.


2. Calculate the average user rating for the content type 'Movie'.

3. Show the result grouped by type name.

MYSQL Query And OUPUT:

3. Dicing:
Dicingissimilartoslicing,butitworksalittlebitdifferently. Whenonethinksofslicing, filtering
is done to focus on a particular attribute. Dicing, on the other hand, is more of a zoom
feature that selects a subset over all the dimensions, but for specific values of the
dimension.

QUESTION:

Write an SQL query to:

1. Display the year, quarter, and type name of the content.

2. Calculate the average user rating for content of type 'TV'.

3. Restrict the results to the year 2023 and Quarter 1 only.

4. Group the results by year, quarter, and type name.

MYSQL Query And Output:


4. Pivot:
Pivot otherwise known as Rotate changes the dimensional orientation of the cube, i.e.
rotates the data axes to view the data from different perspectives. Pivot groups data with
different dimensions. The below cubes shows 2D representation of Pivot.

QUESTION:

Write an SQL query to:

1. Display the content type name.

2. Show the average user rating for each content type, broken down month-wise (Jan to Sep).

3. Arrange the months as separate columns (Jan, Feb, Mar, …, Sep).

4. Round the average ratings to two decimal places.

5. Group the results by content type name.

MYSQL Query And OUPUT:


Conclusion:

R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-04

Aim:Implementation of Bayesian Classification Algorithm..

Software used: Java/C/Python

Theory:

It is a classification technique based on Bayes’ Theorem with an assumption of


independence among predictors. A Naive Bayes classifier assumes that the presence ofa
particular feature in a class is unrelated to the presence of any other feature. For example,
a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter.
Even if these features depend on each other or upon the existence of the other features,
all of these properties independently contribute to the probability that this fruit is an apple
and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c),
P(x) and P(x|c).

Above, a training data set of weather and corresponding target variable ‘Play’
(suggesting possibilities of playing). Now, we need to classify whether players will play
or not based on weathercondition.

Step 1: Convert the data set into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
• P(c|x) is the posterior probability of class (c, target) given predictor (x,attributes).
• P(c) is the prior probability ofclass.
• P(x|c) is the likelihood which is the probability of predictor givenclass.
• P(x) is the prior probability ofpredictor.
• Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class.
The class with the highest posterior probability is the outcomeof prediction.
• Problem: Players will play if weather is sunny. Is this statementcorrect?

We can solve it using above discussed method of posterior probability.

• P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P(Sunny)


• HerewehaveP(Sunny|Yes)=3/9=0.33,P(Sunny)=5/14=0.36,P(Yes)=9/14=0.64
• Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which hashigher probability.

Naive Bayes uses a similar method to predict the probability of different class
based on various attributes. This algorithm is mostly used in text classification and
with problems having multiple classes.

Advantages:
• It is easy and fast to predict class of test data set. It also perform well in multi class prediction
• When assumption of independence holds, a Naive Bayes classifier performs better compare to
other models like logistic regression and you need less training data.
• It perform well in case of categorical input variables compared to numerical variable(s). For
numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

Disadvantages:
• If categorical variable has a category (in test data set), which was not observed in training data
set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This
is often known as “Zero Frequency”. To solve this, we can use the smoothing technique.
One of the simplest smoothing techniques is called Laplace estimation.
• Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is
almost impossible that we get a set of predictors which are completely independent.

Applications of Naive Bayes Algorithms


• Real time Prediction: Naive Bayes is an eager learning classifier and itis sure fast. Thus, it could
be used for making predictions in realtime.
• Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here
we can predict the probability of multiple classes of targetvariable.
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayesclassifiers mostly used in
text classification (due to better result in multi class problems and independence rule) have
higher success rate as compared to other algorithms. As a result, it is widely used in Spam
filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify
positive and negative customersentiments)
• Recommendation System: Naive Bayes Classifier andCollaborativeFilteringtogether builds a
Recommendation System that uses machine learning and data mining techniques
tofilterunseeninformationandpredictwhetherauserwouldlikeagivenresourceornot

PROGRAM:

import java.util.*;

public class AnimeNaiveBayes {

static class Anime {


String type;
String episodes;
String ratingClass;

Anime(String type, String episodes, String ratingClass) {


this.type = type;
this.episodes = episodes;
this.ratingClass = ratingClass;
}
}

static List<Anime> dataset = new ArrayList<>();

// --- Main Program ---


public static void main(String[] args) {

)
dataset.add(new Anime("Movie", "1", "High"));
dataset.add(new Anime("TV", "64", "High"));
dataset.add(new Anime("TV", "51", "High"));
dataset.add(new Anime("TV", "24", "Low"));
dataset.add(new Anime("TV", "51", "Low"));
dataset.add(new Anime("TV", "10", "Low"));
dataset.add(new Anime("TV", "148", "Low"));
dataset.add(new Anime("OVA", "110", "Low"));
dataset.add(new Anime("Movie", "1", "Low"));
dataset.add(new Anime("TV", "13", "Low"));

int totalHigh = (int) dataset.stream().filter(a ->


a.ratingClass.equals("High")).count();
int totalLow = (int) dataset.stream().filter(a ->
a.ratingClass.equals("Low")).count();

double pHigh = (double) totalHigh / dataset.size();


double pLow = (double) totalLow / dataset.size();

Scanner sc = new Scanner(System.in);


System.out.println("Enter Type (TV/Movie/OVA): ");
String qType = sc.nextLine();
System.out.println("Enter Episodes: ");
String qEpisodes = sc.nextLine();

double typeHigh = calcLikelihood("type", qType, "High");


double typeLow = calcLikelihood("type", qType, "Low");
double epHigh = calcLikelihood("episodes", qEpisodes, "High");
double epLow = calcLikelihood("episodes", qEpisodes, "Low");

double posteriorHigh = typeHigh * epHigh * pHigh;


double posteriorLow = typeLow * epLow * pLow;

System.out.println("\nP(High | input) = " + posteriorHigh);


System.out.println("P(Low | input) = " + posteriorLow);

if (posteriorHigh > posteriorLow)


System.out.println("Prediction: HIGH rated anime");
else
System.out.println("Prediction: LOW rated anime");
}

private static double calcLikelihood(String attr, String value, String


target) {
int countAttr = 0;
int countClass = 0;

for (Anime a : dataset) {


if (a.ratingClass.equals(target)) {
countClass++;
if (attr.equals("type") && a.type.equals(value)) countAttr++;
if (attr.equals("episodes") && a.episodes.equals(value))
countAttr++;
}
}
if (countClass == 0) return 0;
return (double) countAttr / countClass;
}
}

INPUT &OUTPUT:

Conclusion:

R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-05

Aim:Implementation of Data Discretization and Visualization

Software used: Java/C/Python

Theory: Data discretization refers to a method of converting a huge number of data values
into smaller ones so that the evaluation and management of data become easy. In other
words, data discretization is a method of converting attributes values of continuous data into
a finite set of intervals with minimum data loss. There are two forms of data discretization first
is supervised discretization, and the second is unsupervised discretization. Supervised
discretization refers to a method in which the class data is used. Unsupervised discretization
refers to a method depending upon the way which operation proceeds. It means it works on
the top-down splitting strategy and bottom-up merging strategy.

Example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Some techniques of data discretization:

Histogram analysis

Histogram refers to a plot used to represent the underlying frequency distribution of a


continuous data set. Histogram assists the data inspection for data distribution. For example,
Outliers, skewness representation, normal distribution representation, etc.
Binning

Binning refers to a data smoothing technique that helps to group a huge number of continuous
values into smaller values. For data discretization and the development of idea hierarchy, this
technique can also be used.

Cluster Analysis

Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing


the values of x numbers into clusters to isolate a computational feature of x.

Data discretization using decision tree analysis

Data discretization refers to a decision tree analysis in which a top-down slicing technique is
used. It is done through a supervised procedure. In a numeric attribute discretization, first, you
need to select the attribute that has the least entropy, and then you need to run it with the help
of a recursive process. The recursive process divides it into various discretized disjoint
intervals, from top to bottom, using the same splitting criterion.

Data discretization using correlation analysis

Discretizing data by linear regression technique, you can get the best neighbouring interval,
and then the large intervals are combined to develop a larger overlap to form the final 20
overlapping intervals. It is a supervised procedure.

Data visualization

Data visualization is actually a set of data points and information that are represented
graphically to make it easy and quick for user to understand. Data visualization is good if it
has a clear meaning, purpose, and is very easy to interpret, without requiring context. Tools
of data visualization provide an accessible way to see and understand trends, outliers, and
patterns in data by using visual effects or elements such as a chart, graphs, and maps.

Data Visualization Techniques:

• Histogram

A histogram is a graphical display of data using bars of different heights. In a


histogram, each bar groups numbers into ranges. Taller bars show that more data
falls in that range. A histogram displays the shape and spread of continuous sample
data.
• Boxplots

A box plot is a graph that gives you a good indication of how the values in the
data are spread out. Although box plots may seem primitive in comparison to
a histogram or density plot, they have the advantage of taking up less space,
which is useful when comparing distributions between many groups or datasets.

Boxplot
900000
800000
700000
600000
500000
400000
300000
200000
100000
0
Movie TV OVA

• Scatter plots

Scatter plots are useful to display the relative density of two dimensions of data.
Well-designed ones quantify and correlate complex sets of data in an easy-to-read
manner. Often, these charts are used to discover trends and data, as much as they
are to visualize the data.

Scatterplot: Episode vs rating


9.4

9.35

9.3

9.25
Rating

9.2

9.15

9.1

9.05
0 20 40 60 80 100 120 140 160
Episodes

• Matrix plots

These are the special types of plots that use two-dimensional matrix data for
visualization. It is difficult to analyze and generate patterns from matrix data because
of its large dimensions. So, this makes the process easier by providing color coding
to matrix data.

• Parallel Coordinates

Parallel coordinates is a visualization technique used to plot individual data


elements across many performance measures. Each of the measures corresponds
to a vertical axis and each data element is displayed as a series of connected
points along the measure/axes.
• Star plots

The star plot (Chambers 1983) is a method of displaying multivariate data. Each
star represents a single observation. Typically, star plots are generated in a multi-
plot format with many stars on each page and each star representing one
observation.Star plots are used to examine the relative values for a single data
point.
• Dygraphs

dygraphs is an open source JavaScript library that produces produces interactive,


zoomable charts of time series. It is designed to display dense data sets and
enable users to explore and interpret them.
• Zing chart

JavaScript Charts in one powerful declarative library | ZingChart See what


ZingChart's 50+ built-in chart types & modules can do for your data visualization
projects. Create animated & interactive charts with hundreds of thousands of data
records using the ZingChart JavaScript charting library.
• Instant Atlas

The InstantAtlas team prepare and manage large statistical indicator data sets and
deliver community information systems, local observatories and knowledge hub
websites for clients as fully managed services. so that you can build your own
services and sites using ArcGIS Online and WordPress.

• Timeline

A timeline is a great data visualization technique when you wish to show data in a
chronological order and highlighting those important points in time. To create a
Timeline, simply layout your data points along a PowerPoint shape, and mark the
data off to visually see your overall project.

PROGRAM:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
import statistics
import math
from collections import OrderedDict

print("\n\t\tSmoothing by Bin Means\n")

x = []
print("\nEnter the data:")
x = list(map(float, input().split()))
print("\nEnter the number of bins:")
bi = int(input())

X_dict = OrderedDict()
x_old = {}
x_new = {}

for i in range(len(x)):
X_dict[i] = x[i]
x_old[i] = x[i]

x_dict = sorted(X_dict.items(), key=lambda item: item[1])


binn = []
avrg = 0
i = 0
k = 0
num_of_data_in_each_bin = int(math.ceil(len(x) / bi))

for g, h in X_dict.items():
if i < num_of_data_in_each_bin:
avrg += h
i += 1
else:
k += 1
i = 1
binn.append(round(avrg / num_of_data_in_each_bin, 3))
avrg = h

rem = len(x) % bi
if rem == 0:
binn.append(round(avrg / num_of_data_in_each_bin, 3))
else:
binn.append(round(avrg / rem, 3))

i = 0
j = 0
for g, h in X_dict.items():
if i < num_of_data_in_each_bin:
x_new[g] = binn[j]
i += 1
else:
i = 1
j += 1
x_new[g] = binn[j]

print("\nNumber of data in each bin:")


print(str(math.ceil(len(x) / bi)) + "\n")

print("\nPartitioning elements: ")


c = 0
for i in range(0, bi):
print("Bin " + str(i + 1) + ": ", end=' ')
for j in range(0, math.ceil(len(x) / bi)):
if c < len(x_old):
print(round(x_old[c], 3), end=' ')
c += 1
print("\n")

print("\nSmoothing by bin means: ")


c = 0
for i in range(0, bi):
print("Bin " + str(i + 1) + ": ", end=' ')
for j in range(0, math.ceil(len(x) / bi)):
if c < len(x_new):
print(round(x_new[c], 3), end=' ')
c += 1
print("\n")

Discretization Histogram Example:

import matplotlib.pyplot as plt

# Example anime ratings data


ratings = [9.10, 9.12, 9.13, 9.14, 9.15, 9.20, 9.22, 9.25, 9.35]

plt.hist(ratings, bins=5, color="#f0ad4e", edgecolor="black") # 5 bins like


in the image
plt.title("Histogram of Anime Ratings")
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.savefig("anime_ratings_histogram.png")
plt.show()

Output:

Conclusion:

R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-06

Aim:Perform data preprocessing task and demonstrate Classification, Clustering, Association


algorithm on data sets using data mining tool(WEKA / R tool)

Software used: WEKA


Theory:

Waikato Environment for Knowledge Analysis (Weka) is a popular suite of machine learning
software written in Java, developed at the University of Waikato, New Zealand. It is free
software licensed under the GNU General Public License.

Weka is a workbench that contains a collection of visualization tools and algorithms for
data analysis and predictive modeling, together with graphical user interfaces for easy access
to these functions.

This original version was primarily designed as a tool for analyzing data from agricultural
domains, but the more recent fully Java-based version (Weka 3), for which development
started in 1997, is now used in many different application areas, in particular for educational
purposes and research.

Advantages of Weka include:

• Free availability under the GNU General Public License.Portability, since it is fully
implemented in the Java programming language and thus runs on almost any modern
computing platform.
• A comprehensive collection of data preprocessing and modeling techniques.

• Ease of use due to its graphical user interfaces.

Weka supports several standard data mining tasks, more specifically, data preprocessing,
clustering, classification, regression, visualization, and feature selection. All of Weka's
techniques are predicated on the assumption that the data is available as one flat file or
relation, where each data point is described by a fixed number of attributes (normally,
numeric or nominal attributes, but some other attribute types are also supported). Weka
provides access to SQL databases using Java Database Connectivity and can process the
result returned by a database query. It is not capable of multi-relational data mining, but
there is separate software for converting a collection of linked database tables into a single
table that is suitable for processing using Weka.

Weka's main user interface is the Explorer, but essentially the same functionality can be
accessed through the component-based Knowledge Flow interface and from thecommand
line. There is also the Experimenter, which allows the systematic comparison of the predictive
performance of Weka's machine learning algorithms on a collection of datasets.

The Explorer interface features several panels providing access to the main components of
the workbench:

The Preprocess panel has facilities for importing data from a database, a comma-
separated values (CSV) file, etc., and for preprocessing this data using a so-called filtering
algorithm. These filters can be used to transform the data (e.g., turning numeric
attributes into discrete ones) and make it possible to delete instances and attributes
according to specific criteria.

The Classify panel enables applying classification and regression algorithms

(indiscriminately called classifiers in Weka) to the resulting dataset, to estimate

the accuracy of the resulting predictive model, and to visualize erroneous

predictions, receiver operating characteristic (ROC) curves, etc., or the model itself

(if the model is amenable to visualization like, e.g., a decision tree).

The Associate panel provides access to association rule learners that attempt to
identify all important interrelationships between attributes in the data.

The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple
k-means algorithm. There is also an implementation of the expectation maximization
algorithm for learning a mixture of normal distributions.

The Select attributes panel provides algorithms for identifying the most predictive
attributes in a dataset.

The Visualize panel shows a scatter plot matrix, where individual scatter plots can be
selected and enlarged, and analyzed further using various selection operators.

Preprocessing in WEKA

Selecting or Filtering Attributes

In the "Filter" panel, click on the "Choose" button. This will show a popup window
with a list available filters. Scroll down the list and select the
"weka.filters.unsupervised.attribute.Remove" filter as shown in Figure.
Classification using WEKA:

This experiment illustrates the use of naïve bayes classifier in weka. Consider the sample
data set “employee”data available at arff format. This document assumes that appropriate
data pre processing has been performed.

Steps involved in this experiment:

1. Begin the experiment by loading the data (employee.arff) into weka.

Step2: Next we select the “classify” tab and click “choose” button to select the “Naïve
Bayes”classifier.

Step3: Now specify the various parameters. These can be specified by clicking in the text box
to the right of the chose button. In this example, accept the default values his default version
does perform some pruning but does not perform error pruning.

Step4: Under the “text “options in the main pane l. select the 10-fold cross validation as our
evaluation approach. Since we don’t have separate evaluation data set, this is necessary to
get a reasonable idea of accuracy of generated model.

Step-5: nowclick”start”to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.
Step-6: Note that the classification accuracy ofmodel is about 69%.this indicates that we may
find more work. (Either in preprocessing or in selecting current parameters for the
classification)

Step-7: Now weka also lets us a view a graphical version of the classification tree. This

can be done by right clicking the last result set and selecting “visualize tree” from the

pop-up menu.

Step-8: Use the model to classify the new instances.

Step-9: In the main panel under “text “options click the “supplied test set” radio button and
then click the “set” button. This will show pop-up window which will allow you to open the file
containing test instances.

Data set employee.arff:

@relation employee

@attribute age {25, 27, 28, 29, 30, 35, 48} @attribute
salary{10k,15k,17k,20k,25k,30k,35k,32k} @attribute
performance {good, avg, poor} @data

25, 10k, poor

27, 15k, poor

27, 17k, poor

28, 17k, poor

29, 20k, avg

30, 25k, avg

29, 25k, avg

30, 20k, avg

35, 32k, good


48, 34k, good

48, 32k, good

The following screenshot shows the classification rules that were generated when naive bayes
algorithm is applied on the given dataset
Clustering Using WEKA:

This experiment illustrates the use of simple k-mean clustering with Weka explorer.
The sample data set used for this example is based on the iris data available in ARFF format.
This document assumes that appropriate preprocessing has been performed. This iris dataset
includes 150 instances.

Steps involved in this Experiment

Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.

Step 2: In order to perform clustering select the ‘cluster’ tab in the explorer and click on the

choose button. This step results in a dropdown list of available clustering algorithms.

Step 3 : In this case we select ‘simple k-means’.

Step 4: Next click in text button to the right of the choose button to get popup window shown
in the screenshots. In this window we enter six on the number of clusters and we leave the
value of the seed on as it is. The seed value is used in generating a random number which is
used for making the internal assignments of instances of clusters.

Step 5 : Once of the option have been specified. We run the clustering algorithm there we
must make sure that they are in the ‘cluster mode’ panel. The use of training set option is
selected and then we click ‘start’ button. This process and resulting window are shown in the
following screenshots.

Step 6 : The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters centroid are
means vectors for each clusters. This clusters can be used to characterized the cluster. For
eg, the centroid of cluster1 shows the class iris.versicolor mean value of the sepal length is
5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.

Step 7: Another way of understanding characterstics of each cluster through visualization


,we can do this, try right clicking the result set on the result. List panel and selecting the
visualize cluster assignments.

The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.
Interpretation of the above visualization

From the above visualization, we can understand the distribution of sepal length
and petal length in each cluster. For instance, for each cluster is dominated by petal
length. In this case by changing the color dimension to other attributes we can see
their distribution with in each of the cluster.

Step 8: We can assure that resulting dataset which included each instance along
with its assign cluster. To do so we click the save button in the visualization window
and save the result iris k-mean .The top portion of this file is shown in the following
figure.
Association Rule Mining in WEKA:

This experiment illustrates some of the basic elements of asscociation rule mining
using WEKA. The sample dataset used for this example is test.arff

Step1: Open the data file in Weka Explorer. It is presumed that the required data
fields have been discretized. In this example it is age attribute.

Step2: Clicking on the associate tab will bring up the interface for
association rule algorithm.

Step3: Use apriorialgorithm..

Step4: Inorder to change the parameters for the run (example support, confidence
etc) we click on the text box immediately to the right of the choose button.

Dataset test.arff

@relation test

@attributeadmissionyear {2005,2006,2007,2008,2009,2010}

@attribute course {cse,mech,it,ece}

@data

2005, cse

2005, it

2005, cse

2006, mech

2006, it

2006, ece

2007, it

2007, cse

2008, it
2008, cse

2009, it

2009, ece

The following screenshot shows the association rules that were generated when
apriori algorithm is applied on the given dataset.

Implementation on case study:

1. Preprocessing data
2.Classification
3.Clustering
4. Association

Conclusion:

R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-07

Aim:To implement Clustering Algorithm. (K means).

Software used: Java/C/Python

Theory:

Clustering is the process of grouping the data into classes or clusters, so that objects within a
cluster have high similarity in comparison to one another but are very dissimilar to objects in other
clusters. Dissimilarities are assessed based on the attribute values describing the objects. Often,
distance measures are used. Clustering has its roots in many areas, including data mining, statistics,
biology, and machine learning.

Clustering is also called data segmentation in some applications because


clustering partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection, where outliers (values that are “far
away” from any cluster) may be more interesting than common cases. Applications of
outlier detection include the detection of credit card fraud and the monitoring of
criminal activities in electronic commerce

Partitioning Methods

Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm
organizes the objects into k partitions (k _ n), where each partition represents a cluster. The
clusters are formed to optimize an objective partitioning criterion,

such as a dissimilarity function based on distance, so that the objects within a


cluster are “similar,” whereas the objects of different clusters are “dissimilar” in
terms of the data set attributes.
Centroid-Based Technique: The k-Means Method

The k-means algorithm takes the input parameter, k, and partitions a set of n objects
into k clusters so that the resulting intracluster similarity is high but the intercluster
similarity is low. Cluster similarity is measured in regard to the mean value of the
objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.
Advantages

• Easy to implement

• With a large number of variables, K-‐Means may be computationally faster than


hierarchical clustering (if K is small).
• k-‐Means may produce Higher clusters than hierarchical clustering

• An instance can change cluster (move to another cluster) when the centroids are re-
‐ computed.

Disadvantages

• Difficult to predict the number of clusters (K-‐Value)

• Initial seeds have a strong impact on the final results

• The order of the data has an impact on the final results

• Sensitive to scale: rescaling your datasets (normalization or standardization) will


completely change results

Applications:

The K-means clustering algorithm is used to find groups which have not been explicitly
labeled in the data. This can be used to confirm business assumptions about what types of
groups exist or to identify unknown groups in complex data sets. Once the algorithm has been
run and the groups are defined, any new data can be easily assigned to the correct group.

This is a versatile algorithm that can be used for any type of grouping. Some
examples of use cases are:

Behavioral segmentation:

o Segment by purchase history

o Segment by activities on application, website, or platform

o Define personas based on interests

o Create profiles based on activity monitoring


Inventory categorization:

o Group inventory by sales activity

o Group inventory by manufacturing metrics


Sorting sensor measurements:

o Detect activity types in motion sensors

o Group images
o Separate audio
o Identify groups in health monitoring
Detecting bots or anomalies:
o Separate valid activity groups from bots

o Group valid activity to clean up outlier detection

PROGRAM:
import java.util.Scanner;

public class KMeansClustering {

static final int MAX_POINTS = 100;


static final int MAX_CLUSTERS = 10;

// Function to calculate Euclidean distance


static double distance(double x1, double y1, double x2, double y2) {
return Math.sqrt((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2));
}

public static void main(String[] args) {


Scanner sc = new Scanner(System.in);

int n, k, i, j, iter;
int[] cluster = new int[MAX_POINTS];
double[] x = new double[MAX_POINTS];
double[] y = new double[MAX_POINTS];
double[] centroidX = new double[MAX_CLUSTERS];
double[] centroidY = new double[MAX_CLUSTERS];
double[] newCentroidX = new double[MAX_CLUSTERS];
double[] newCentroidY = new double[MAX_CLUSTERS];
int[] count = new int[MAX_CLUSTERS];
boolean changed;

// Input number of points


System.out.print("Enter number of data points: ");
n = sc.nextInt();

// Input coordinates
System.out.println("Enter the coordinates (x y) for each point:");
for (i = 0; i < n; i++) {
x[i] = sc.nextDouble();
y[i] = sc.nextDouble();
cluster[i] = -1; // initialize cluster assignment
}

// Input number of clusters


System.out.print("Enter number of clusters (k): ");
k = sc.nextInt();
// Initialize first k points as centroids
for (i = 0; i < k; i++) {
centroidX[i] = x[i];
centroidY[i] = y[i];
}

// Repeat until convergence


for (iter = 0; iter < 100; iter++) {
changed = false;

// Step 1: Assign points to nearest centroid


for (i = 0; i < n; i++) {
double minDist = distance(x[i], y[i], centroidX[0], centroidY[0]);
int minCluster = 0;
for (j = 1; j < k; j++) {
double dist = distance(x[i], y[i], centroidX[j], centroidY[j]);
if (dist < minDist) {
minDist = dist;
minCluster = j;
}
}
if (cluster[i] != minCluster) {
cluster[i] = minCluster;
changed = true;
}
}

// Step 2: Update centroids


for (j = 0; j < k; j++) {
newCentroidX[j] = 0;
newCentroidY[j] = 0;
count[j] = 0;
}
for (i = 0; i < n; i++) {
newCentroidX[cluster[i]] += x[i];
newCentroidY[cluster[i]] += y[i];
count[cluster[i]]++;
}
for (j = 0; j < k; j++) {
if (count[j] > 0) {
centroidX[j] = newCentroidX[j] / count[j];
centroidY[j] = newCentroidY[j] / count[j];
}
}

// Print iteration result


System.out.println("\nIteration " + (iter + 1) + ":");
for (j = 0; j < k; j++) {
System.out.printf(" Centroid %d: (%.2f, %.2f)%n", j + 1,
centroidX[j], centroidY[j]);
}
for (i = 0; i < n; i++) {
System.out.printf(" Point (%.2f, %.2f) -> Cluster %d%n", x[i], y[i],
cluster[i] + 1);
}

// Stop if no point changed cluster


if (!changed) break;
}

// Final Output
System.out.println("\nFinal Clusters:");
for (j = 0; j < k; j++) {
System.out.print("Cluster " + (j + 1) + ": ");
for (i = 0; i < n; i++) {
if (cluster[i] == j) {
System.out.printf("(%.2f, %.2f) ", x[i], y[i]);
}
}
System.out.println();
}

sc.close();
}
}

INPUT &OUTPUT:
Conclusion:

R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-08

Aim: Implementation of any one Hierarchical Clustering method.

Software used: Java/Python


Theory:

A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data points as a separate cluster. Then, it repeatedly executes the
subsequent steps:

1. Identify the 2 clusters which can be closest together, and


2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters are
merged together.

In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram
called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of merges or splits)
graphically represents this hierarchy and is an inverted tree that describes the order in which factors are
merged (bottom-up view) or cluster are break up (top-down view).

There are two types of hierarchical clustering methods:

1. Agglomerative hierarchical clustering:

This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic
clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain
termination conditions are satisfied.

2. Divisive hierarchical clustering:


This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with
all objects in one cluster.It subdivides the cluster into smaller and smaller pieces, until each
object forms a cluster on its own or until it satisfies certain termination conditions, such as a
desired number of clusters is obtained or the diameter of each cluster is within a certain
threshold.

AGGLOMERATIVE HIERARCHICAL CLUSTERING: - Figure shows the application of AGNES


(AGglomerativeNESting), an agglomerative hierarchical clustering method to a data set of five objects(a, b,
c, d, e).

• Initially, AGNES places each object into a cluster of its own.


• The clusters are then merged step-by-step according to some criterion.
Agglomerative Algorithm: (AGNES)

Given

-a set of N objects to be clustered

-an N*N distance matrix ,

The basic process of clustering id this:

Step1: Assign each object to a cluster so that for N objects we have N clusters each containing just one
Object.

Step2: Let the distances between the clusters be the same as the distances between the objects they
contain.

Step3: Find the most similar pair of clusters and merge them into a single cluster so that we now have one
cluster less.

Step4: Compute distances between the new cluster and each of the old clusters.

Step5: Repeat steps 3 and 4 until all items are clustered into a single cluster of size N.

• Step 4 can be done in different ways and this distinguishes single and complete linkage.

-> For complete-linkage algorithm:

o clustering process is terminated when the maximum distance between nearest clusters
exceeds an arbitrary threshold.

-> For single-linkage algorithm:

o clustering process is terminated when the minimum distance between nearest clusters
exceeds an arbitrary threshold.

o EXAMPLE:
Suppose this data is to be clustered.

• In this example, cutting the tree after the second row of the dendrogram will yield clusters {a} {b
c} {d e} {f}.

• Cutting the tree after the third row will yield clusters {a} {b c} {d e f}, which is a coarser
clustering, with a smaller number but larger clusters.

The hierarchical clustering dendrogram would be as such:

In our example, we have six elements {a} {b} {c} {d} {e} and {f}.

The first step is to determine which elements to merge in a cluster.

Usually, we take the two closest elements, according to the chosen distance.

Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances
updated. Suppose we have merged the two closest elements b and c, we now have the following clusters
{a}, {b, c}, {d}, {e} and {f}, and want to merge them further.

To do that, we need to take the distance between {a} and {b c}, and therefore define the distance
between two clusters. Usually the distance between two clusters A and B is one of the following:

• The maximum distance between elements of each cluster (also called complete-linkage clustering):
max {d(x,y):x∈A,y∈B}

• The minimum distance between elements of each cluster (also called single-linkage clustering):

min {d(x,y):x∈A,y∈B}

• The mean distance between elements of each cluster (also called average linkage clustering):

Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and
one can decide to stop clustering either when the clusters are too far apart to be merged (distance
criterion) or when there is a sufficiently small number of clusters (number criterion).

PROGRAM:
import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster.hierarchy import dendrogram

def euclidean_distance(p1, p2):


return np.sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)

n = int(input("Enter number of data points: "))


points = []
for i in range(n):
x, y = map(float, input(f"Enter x y for point {i+1}: ").split())
points.append([x, y])
points = np.array(points)

k = int(input("Enter desired number of clusters: "))

clusters = [[i] for i in range(n)]


cluster_ids = list(range(n))
next_cluster_id = n
merge_history = []

while len(clusters) > 1:


min_dist = float("inf")
merge_i = merge_j = -1
for i in range(len(clusters)):
for j in range(i+1, len(clusters)):
dist = min(
euclidean_distance(points[p1], points[p2])
for p1 in clusters[i] for p2 in clusters[j]
)
if dist < min_dist:
min_dist = dist
merge_i, merge_j = i, j
left_id = cluster_ids[merge_i]
right_id = cluster_ids[merge_j]
new_cluster = clusters[merge_i] + clusters[merge_j]
new_size = len(new_cluster)
merge_history.append((left_id, right_id, min_dist, new_size))
if merge_i > merge_j:
clusters.pop(merge_i); cluster_ids.pop(merge_i)
clusters.pop(merge_j); cluster_ids.pop(merge_j)
else:
clusters.pop(merge_j); cluster_ids.pop(merge_j)
clusters.pop(merge_i); cluster_ids.pop(merge_i)
clusters.append(new_cluster)
cluster_ids.append(next_cluster_id)
next_cluster_id += 1

cluster_map = {i: [i] for i in range(n)}


current_set = set(range(n))
next_id = n
labels = [0]*n
for left_id, right_id, dist, size in merge_history:
cluster_map[next_id] = cluster_map[left_id] + cluster_map[right_id]
current_set.remove(left_id)
current_set.remove(right_id)
current_set.add(next_id)
if len(current_set) == k:
break
next_id += 1

sorted_clusters = sorted(list(current_set), key=lambda cid: min(cluster_map[cid]))


for label_idx, cid in enumerate(sorted_clusters, start=1):
for pt in cluster_map[cid]:
labels[pt] = label_idx

print("\nCluster assignments:")
for idx, lab in enumerate(labels, start=1):
print(f"Point {idx} ({points[idx-1][0]}, {points[idx-1][1]}) -> Cluster {lab}")

Z = np.array(merge_history)

plt.figure(figsize=(10,6))
dendrogram(Z, labels=[f"P{i+1}" for i in range(n)], color_threshold=0)
plt.title("Dendrogram (single-linkage, manual calculation)")
plt.ylabel("Distance (Euclidean)")
plt.show()

plt.figure(figsize=(6,6))
colors = plt.cm.get_cmap('tab10')
for i in range(n):
plt.scatter(points[i,0], points[i,1], color=colors((labels[i]-1) % 10), s=80,
edgecolor='k')
plt.text(points[i,0]+0.02, points[i,1]+0.02, f"P{i+1}", fontsize=9)
plt.title(f"Points colored by cluster (k={k})")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.show()

INPUT &OUTPUT:
Conclusion:

R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-09

Aim:Implementation of Association rule Mining.(Apriori algorithm)

Software used: Java/C/Python

Theory:

Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear in a data set frequently.
For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a
frequent itemset. Finding such frequent patterns plays an essential role in mining associations, correlations, and many
other interesting relationships among data.

Moreover, it helps in data classification, clustering, and other data mining tasks
as well. Thus, frequent pattern mining has become an important data mining task
and a focused theme in data mining research.

The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation

Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets
for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge
of frequent itemset properties, as we shall see following. Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are usedtoexplore (k+1)-itemsets. First, the setof frequent 1-itemsets is found by scanning
the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so
on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database.

To improve the efficiency of the level-wise generation of frequent itemsets, an important property
called the Apriori property, presented below, is used to reduce the search space.We will first describe
this property, and then show an example illustrating its use.

Apriori property: All nonempty subsets of a frequent itemset must also be


frequent A two-step process is followed, consisting of join and prune actions.
Advantages

• It is an easy-to-implement and easy-to-understand algorithm.

• It can be used on large itemsets.

Disadvantages

• Sometimes, it may need to find a large number of candidate rules which can be
computationally expensive.
• Calculating support is also expensive because it has to go through the entire database.
Consider the following example. Before beginning the process, let us set the support threshold to
50%, i.e. only those items are significant for which support is more than 50%.

Example:

Step 1: Create a frequency table of all the items that occur in all the transactions. For our case:

Item Frequency (No. of transactions)

Onion(O) 4

Potato(P) 5

Burger(B) 4

Milk(M) 4

Beer(Be) 2
Step 2: We know that only those elements are significant for which the support is greater than
or equal to the threshold support. Here, support threshold is 50%, hence only those items are
significant which occur in more than three transactions and such items are Onion(O),
Potato(P), Burger(B), and Milk(M). Therefore, we are left with:

Item Frequency (No. of transactions)

Onion(O) 4

Potato(P) 5

Burger(B) 4

Milk(M) 4

The table above represents the single items that are purchased by the customers frequently.

Step 3: The next step is to make all the possible pairs of the significant items keeping in mind that
the order doesn’t matter, i.e., AB is same as BA. To do this, take the first item and pair it with all the
others such as OP, OB, OM. Similarly, consider the second item and pair it with preceding items,
i.e., PB, PM. We are only considering the preceding items because PO (same as OP) already
exists. So, all the pairs in our example are OP, OB, OM, PB, PM, BM.
Step 4: We will now count the occurrences of each pair in all the transactions.

Itemset Frequency (No. of transactions)

OP 4

OB 3

OM 2

PB 4

PM 3

BM 2

Step 5: Again only those itemsets are significant which cross the support threshold, and
those are OP, OB, PB, and PM.

Step 6: Now let’s say we would like to look for a set of three items that are purchased together.
We will use the itemsets found in step 5 and create a set of 3 items.
To create a set of 3 items another rule, called self-join is required. It says that from the item pairs OP,
OB, PB and PM we look for two pairs with the identical first letter and so we get

OP and OB, this gives OPB


PB and PM, this gives PBM

Next, we find the frequency for these two itemsets.

Itemset Frequency (No. of transactions)

OPB 4

PBM 3

Applying the threshold rule again, we find that OPB is the only significant itemset.

Therefore, the set of 3 items that was purchased most frequently is OPB.

The example that we considered was a fairly simple one and mining the frequent itemsets stopped at 3 items but in
practice, there are dozens of items and this process could continue to many items. Suppose we got the significant
sets with 3 items as OPQ, OPR, OQR, OQS and PQR and now we want to generate the set of 4 items. For this, we
will look at the sets which have first two alphabets common, i.e,

OPQ and OPR gives OPQR


OQR and OQS gives OQRS

In general, we have to look for sets which only differ in their last letter/item.
Applications:

• Market Basket Analysis

• Network Forensics analysis

• Analysis of diabetic databases

• Adverse drug reaction


detection Ecommerce

Customer analysis

PROGRAM:
import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster.hierarchy import dendrogram

def euclidean_distance(p1, p2):


return np.sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)

n = int(input("Enter number of data points: "))


points = []
for i in range(n):
x, y = map(float, input(f"Enter x y for point {i+1}: ").split())
points.append([x, y])
points = np.array(points)

k = int(input("Enter desired number of clusters: "))

clusters = [[i] for i in range(n)]


cluster_ids = list(range(n))
next_cluster_id = n
merge_history = []

while len(clusters) > 1:


min_dist = float("inf")
merge_i = merge_j = -1
for i in range(len(clusters)):
for j in range(i+1, len(clusters)):
dist = min(
euclidean_distance(points[p1], points[p2])
for p1 in clusters[i] for p2 in clusters[j]
)
if dist < min_dist:
min_dist = dist
merge_i, merge_j = i, j
left_id = cluster_ids[merge_i]
right_id = cluster_ids[merge_j]
new_cluster = clusters[merge_i] + clusters[merge_j]
new_size = len(new_cluster)
merge_history.append((left_id, right_id, min_dist, new_size))
if merge_i > merge_j:
clusters.pop(merge_i); cluster_ids.pop(merge_i)
clusters.pop(merge_j); cluster_ids.pop(merge_j)
else:
clusters.pop(merge_j); cluster_ids.pop(merge_j)
clusters.pop(merge_i); cluster_ids.pop(merge_i)
clusters.append(new_cluster)
cluster_ids.append(next_cluster_id)
next_cluster_id += 1

cluster_map = {i: [i] for i in range(n)}


current_set = set(range(n))
next_id = n
labels = [0]*n
for left_id, right_id, dist, size in merge_history:
cluster_map[next_id] = cluster_map[left_id] + cluster_map[right_id]
current_set.remove(left_id)
current_set.remove(right_id)
current_set.add(next_id)
if len(current_set) == k:
break
next_id += 1

sorted_clusters = sorted(list(current_set), key=lambda cid: min(cluster_map[cid]))


for label_idx, cid in enumerate(sorted_clusters, start=1):
for pt in cluster_map[cid]:
labels[pt] = label_idx

print("\nCluster assignments:")
for idx, lab in enumerate(labels, start=1):
print(f"Point {idx} ({points[idx-1][0]}, {points[idx-1][1]}) -> Cluster {lab}")

Z = np.array(merge_history)

plt.figure(figsize=(10,6))
dendrogram(Z, labels=[f"P{i+1}" for i in range(n)], color_threshold=0)
plt.title("Dendrogram (single-linkage, manual calculation)")
plt.ylabel("Distance (Euclidean)")
plt.show()

plt.figure(figsize=(6,6))
colors = plt.cm.get_cmap('tab10')
for i in range(n):
plt.scatter(points[i,0], points[i,1], color=colors((labels[i]-1) % 10), s=80,
edgecolor='k')
plt.text(points[i,0]+0.02, points[i,1]+0.02, f"P{i+1}", fontsize=9)
plt.title(f"Points colored by cluster (k={k})")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.show()

INPUT &OUTPUT:

Conclusion:Note:-

R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)
EXPERIMENTNO:-10

Aim:Implementation of Page Rank Algorithm.

Software used: Java/C/Python


Theory:

PageRank (PR) is an algorithm used by Google Search to rank websites in their


search engine results. PageRank was named after Larry Page, one of the founders of
Google. PageRank is a way of measuring the importance of website pages. According to
Google:

PageRank works by counting the number and quality of links to a page to determine a
rough estimate of how important the website is. The underlying assumption is that more
important websites are likely to receive more links from other websites.

It is not the only algorithm used by Google to order search engine results, but it is the first
algorithm that was used by the company, and it is the best-known.The above centrality
measure is not implemented for multi-graphs.

Algorithm:

The PageRank algorithm outputs a probability distribution used to represent the likelihood
that a person randomly clicking on links will arrive at any particular page. PageRank can be
calculated for collections of documents of any size. The PageRank computations require several
passes, called “iterations”, through the collection to adjust approximate PageRank values to
more closely reflect the theoretical true value.

Working:
Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or multiple
outbound links from one single page to another single page, are ignored. PageRank is initialized to the same
value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number
of pages on the web at that time, so each page in this example would have an initial value of 1. However, later
versions of PageRank, and the remainder of this section, assume a probability distribution between 0 and 1.
Hence the initial value for each page in this example is 0.25.

The PageRank transferred from a given page to the targets of its outbound links upon the
next iteration is divided equally among all outbound links.

If the only links in the system were from pages B, C, and D to A, each link would transfer
0.25 PageRank to A upon the next iteration, for a total of 0.75.

PR(A)=PR(B)+PR(C)+PR(D)

Suppose instead that page B had a link to pages C and A, page C had a link to page A, and
page D had links to all three pages. Thus, upon the first iteration, page B would transfer half of its
existing value, or 0.125, to page A and the other half, or 0.125, to page C. Page C would transfer all
of its existing value, 0.25, to the only page it links to, A. Since D had three outbound links, it would
transfer one third of its existing value, or approximately 0.083, to A. At the completion of this iteration,
page A will have a PageRank of approximately 0.458.

PR(A)=PR(B)/2+PR(C)/1+PR(D)/3

In other words, the PageRank conferred by an outbound link is equal to the document’s own
PageRank score divided by the number of outbound links L( ).

PR(A)=PR(B)/L(B)+PR(C)/L(C)+PR(D)/L(D)

In the general case, the PageRank value for any page u can be expressed as:
i.e., the PageRank value for a page u is dependent on the PageRank values for each
page v contained in the set Bu (the set containing all pages linking to page u), divided by the
number L(v) of links from page v. The algorithm involves a damping factor for the calculation
of the page rank.

PROGRAM:
nodes = input("Enter node names separated by space (e.g., A B C D): ").split()
n = len(nodes)
node_index = {nodes[i]: i for i in range(n)}
adj_matrix = [[0]*n for _ in range(n)]
print("\nEnter outgoing links for each node (space-separated). Leave empty if no
outgoing links.")
for node in nodes:
links = input(f"Outgoing links from {node}: ").split()
for link in links:
if link not in node_index:
print(f"Warning: {link} is not a valid node. Skipping.")
continue
adj_matrix[node_index[node]][node_index[link]] = 1

while True:
try:
d = float(input("\nEnter damping factor (0-1, e.g., 0.85): "))
if 0 < d < 1:
break
else:
print("Please enter a number between 0 and 1.")
except:
print("Invalid input. Enter a decimal number between 0 and 1.")

epsilon = 0.0001
PR = [1/n]*n
out_degree = [sum(row) for row in adj_matrix]

print("\nPageRank Iterations:\n")
print("Iteration\t" + "\t".join(nodes))
iteration = 0
while True:
iteration += 1
new_PR = [0]*n
for i in range(n):
rank_sum = 0
for j in range(n):
if adj_matrix[j][i] == 1 and out_degree[j] != 0:
rank_sum += PR[j] / out_degree[j]
new_PR[i] = (1 - d)/n + d * rank_sum
print(f"{iteration}\t\t" + "\t".join(f"{x:.4f}" for x in new_PR))
if all(abs(new_PR[i] - PR[i]) < epsilon for i in range(n)):
PR = new_PR
break
PR = new_PR
ranking = sorted([(nodes[i], PR[i]) for i in range(n)], key=lambda x: x[1],
reverse=True)
print("\nFinal Node Ranking:")
for node, _ in ranking:
print(f"{node} -> ", end="")
print("END")
print("\nOrdered nodes by rank:", " -> ".join([node for node, _ in ranking]))

INPUT &OUTPUT:

Conclusion:

R1 R2 R3 R4 Total SignwithDate
(3) (5) (4) (3) (15)

You might also like