0% found this document useful (0 votes)

104 views7 pages

Data Mining Assignment No. 1

The document describes analyzing a dataset of 30 students to predict which students play cricket during their leisure time based on their gender, class, and height. It discusses using entropy, information gain, and Gini impurity calculations to determine that gender is the most significant variable to initially split the decision tree on. It also provides code to generate a decision tree model in Python using scikit-learn and graph the results. Finally, it implements logistic regression, k-nearest neighbors, and other algorithms on a diabetes dataset and compares their accuracy and confusion matrices.

Uploaded by

Saamia A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views7 pages

Data Mining Assignment No. 1

Uploaded by

Saamia A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Data Mining

Assignment no. 1
Question no. 1:
Let’s say we have a sample of 30 students with three variables Gender (Boy/Girl), Class (IX/ X) and
Height (5 to 6 C). 15 out of these 30-play cricket in leisure time. Now, we want to create a model to
predict who will play cricket during leisure period? In this problem, we need to segregate students who
play cricket in their leisure time based on highly significant input variable among all three.

Answer:

1. Entropy
First, we need to find entropy for parent node,
Entropy for Parent Node = -(15/30) log2 (15/30) - (15/30) log2 (15/30) = 1
the entropy is 1 which shows that it is an impure node.

For Split on gender:

Entropy for Female node = -(2/10) log2 (2/10) — (8/10) log2 (8/10) = 0.72
Entropy for Male node = -(13/20) log2 (13/20) — (7/20) log2 (7/20) = 0.93
Entropy for split Gender = (10/30)*0.72 + (20/30)*0.93 = 0.86
Information Gain for split on gender = 1–0.86 = 0.14

For Split on Class:

Entropy for Class IX node = -(6/14) log2 (6/14) — (8/14) log2 (8/14) = 0.99
Entropy for Class X node = -(9/16) log2 (9/16) — (7/16) log2 (7/16) = 0.99
Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99

Information Gain for split on Class = 1– 0.99 = 0.01

Above, you can see that Information Gain for Split on Gender is the Highest among all, so the tree
will split on Gender.

2. Gini
Secondly, we have to find Gini for the given data:
It works with categorical target variable “Success” or “Failure”.

For Split on gender:

Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59

For Split on Class:

Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the
node split will take place on Gender.
Now we have to find Gini impurity, So mathematically we can say,
Gini Impurity = 1-Gini

Question no. 2:
Draw a decision tree using pen and paper and then share it via email. Do it for the

given

– Create a decision tree for the dataset at the following link

Answer:
As it was a big data set and it was to hard to make that graph on paper so I decided to make a program
with uses different python libraries to make a decision tree on the given dataset of “Diabetes”

# -- coding: utf-8 --

"""
Created on Mon Oct 28 19:47:33 2019

@author: sgfghh
"""

# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

col_names = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age', 'diabetes']
# load dataset
pima = pd.read_csv("C:/Users/sgfghh/Downloads/data/data/diabetes_data.csv", header=None,
names=col_names)

print(pima.head())

feature_cols = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age']

#feature_cols = ['triceps', 'insulin', 'bmi', 'dpf', 'age']
X = pima[feature_cols] # Features
y = pima.diabetes # Target variable

# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training
and 30% test

# Create Decision Tree classifer object

clf = DecisionTreeClassifier()

# Train Decision Tree Classifer

clf = clf.fit(X_train,y_train)

#Predict the response for test dataset

y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

from sklearn.tree import export_graphviz

from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
OUTPUT:

Question no. 3:
Further, implement it in python. Apply decision tree, logistic regression and nearest neighbor to the
dataset. Then, compare the accuracy and the confusion matrix.

Code in Python:
"""
Spyder Editor

This is a temporary script file.

"""

#importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing our cancer dataset
dataset = pd.read_csv('C:/Users/sgfghh/Downloads/data/data/diabetes_data.csv')
X = dataset.iloc[:, 1:8].values
Y = dataset.iloc[:, 8].values

#print(dataset.head())
#print("Cancer data set dimensions : {}".format(dataset.shape))

#print(dataset.isnull().sum())
#print(dataset.isna().sum())
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#Using Logistic Regression Algorithm to the Training Set

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, Y_train)
#Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, Y_train)
#Using SVC method of svm class to use Support Vector Machine Algorithm
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, Y_train)
#Using SVC method of svm class to use Kernel SVM Algorithm
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, Y_train)
#Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, Y_train)
#Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)

#Using RandomForestClassifier method of ensemble class to use Random Forest Classification

algorithm

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)
Y_pred = classifier.predict(X_test)

print("Accuracy with each of our model", Y_pred)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test, Y_pred)

print("Confusion matrix", cm)

Output:

Accuracy with each of our model [1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0

0010100000000110000000000010100110000
1100001000000010001010001000110000101
1000000000000010000101000000010000010
1001001000100000000000100100000010000
0 0 0 0 0 1 0]

Confusion matrix [[117 14]

[ 31 30]]

Mathematics 9 - Q3 - Mod11 - Conditions Proving For Triangles Similar - v3
100% (2)
Mathematics 9 - Q3 - Mod11 - Conditions Proving For Triangles Similar - v3
28 pages
Avasthas of Planets
No ratings yet
Avasthas of Planets
13 pages
ML Exam Solutions
No ratings yet
ML Exam Solutions
6 pages
Unit 7 Deterministic Models
No ratings yet
Unit 7 Deterministic Models
71 pages
ML Lab Exp
No ratings yet
ML Lab Exp
7 pages
MachineLearning MidTerm UMT Spring 2021
100% (1)
MachineLearning MidTerm UMT Spring 2021
12 pages
Machine Learning Evaluation Guide
100% (1)
Machine Learning Evaluation Guide
504 pages
Swift's Satirical Masterpieces
No ratings yet
Swift's Satirical Masterpieces
7 pages
Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
No ratings yet
Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
16 pages
ML Imp Que
No ratings yet
ML Imp Que
57 pages
ML Questions
No ratings yet
ML Questions
9 pages
Machine Learnine Experiment by Priyanka
No ratings yet
Machine Learnine Experiment by Priyanka
6 pages
English 5 Co Combined
100% (2)
English 5 Co Combined
85 pages
ES335
No ratings yet
ES335
22 pages
WiFi, Working, Elements of WiFi
100% (2)
WiFi, Working, Elements of WiFi
67 pages
ML5 Implementation
No ratings yet
ML5 Implementation
32 pages
Openlab 1
No ratings yet
Openlab 1
17 pages
Bacdeaf 23032025 115708 Split 1
No ratings yet
Bacdeaf 23032025 115708 Split 1
37 pages
Ii M.A. English Men 33 - Contemporary Literary Theory-I
No ratings yet
Ii M.A. English Men 33 - Contemporary Literary Theory-I
16 pages
Programs Lab Bca
No ratings yet
Programs Lab Bca
16 pages
Sheet1 1
No ratings yet
Sheet1 1
2 pages
Decision Tree Classifier
No ratings yet
Decision Tree Classifier
3 pages
DT 2023 24 Sols
No ratings yet
DT 2023 24 Sols
8 pages
CH1O3 Questions PDF
No ratings yet
CH1O3 Questions PDF
52 pages
ML 6
No ratings yet
ML 6
15 pages
Shobit Sharma (2124399) ML Lab File PDF
No ratings yet
Shobit Sharma (2124399) ML Lab File PDF
19 pages
MlLabManualdocx 2024 09 04 22 02 58
No ratings yet
MlLabManualdocx 2024 09 04 22 02 58
19 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
ML Manual With Outputs
No ratings yet
ML Manual With Outputs
30 pages
ML Lab P-1
No ratings yet
ML Lab P-1
10 pages
ML Prac1-10
No ratings yet
ML Prac1-10
32 pages
Prathamesh KRAI
No ratings yet
Prathamesh KRAI
38 pages
EX - NO:3: Algorithm
No ratings yet
EX - NO:3: Algorithm
11 pages
ML Lab PT
No ratings yet
ML Lab PT
25 pages
Random Forest
No ratings yet
Random Forest
8 pages
MLDA1
No ratings yet
MLDA1
8 pages
Prakhar - Week 5
No ratings yet
Prakhar - Week 5
8 pages
Data Mining Journal 4 Kashan
No ratings yet
Data Mining Journal 4 Kashan
8 pages
Aam Codes
No ratings yet
Aam Codes
8 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Data Mining NOTES
No ratings yet
Data Mining NOTES
57 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
P02 DecisionTrees SolutionNotes
No ratings yet
P02 DecisionTrees SolutionNotes
3 pages
Data Science Code Implementations
No ratings yet
Data Science Code Implementations
274 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
No ratings yet
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
20 pages
Lab Manual ML
No ratings yet
Lab Manual ML
28 pages
Role of Family in Consumer Behaviour
0% (1)
Role of Family in Consumer Behaviour
10 pages
IEEE Conference Team ATOM
No ratings yet
IEEE Conference Team ATOM
5 pages
Machine Learning Laboratory (21AIL66)
No ratings yet
Machine Learning Laboratory (21AIL66)
7 pages
LAB-4 Report
No ratings yet
LAB-4 Report
21 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
For Green Marketing Project
No ratings yet
For Green Marketing Project
16 pages
Decision Trees & Neural Networks
No ratings yet
Decision Trees & Neural Networks
10 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Practical 7 Classification Revision Questions
No ratings yet
Practical 7 Classification Revision Questions
8 pages
Soft Computing Lab Practical Assignment 2
No ratings yet
Soft Computing Lab Practical Assignment 2
10 pages
Bayesian Decision Theory Quiz
No ratings yet
Bayesian Decision Theory Quiz
6 pages
Department of Electronics & Telecommunications Engineering: ETEL71A-Machine Learning and AI
No ratings yet
Department of Electronics & Telecommunications Engineering: ETEL71A-Machine Learning and AI
4 pages
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 3
No ratings yet
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 3
30 pages
Machine Learning Laboratory Record Book: 1 Find S Algorithm
No ratings yet
Machine Learning Laboratory Record Book: 1 Find S Algorithm
22 pages
Fluid Mechanics-I Course Overview
No ratings yet
Fluid Mechanics-I Course Overview
10 pages
Share 'Ch05
100% (1)
Share 'Ch05
81 pages
St. John College of Engineering and Management, Palghar - Maharashtra
No ratings yet
St. John College of Engineering and Management, Palghar - Maharashtra
11 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Python ML Algorithms Guide
No ratings yet
Python ML Algorithms Guide
7 pages
Data Mining Classification Guide
No ratings yet
Data Mining Classification Guide
54 pages
3.4 MOP Setpoint
No ratings yet
3.4 MOP Setpoint
4 pages
Physical Education Class 12 Important Questions Chapter 10 Kinesiology Biomechanics and Sports - Learn CBSE
No ratings yet
Physical Education Class 12 Important Questions Chapter 10 Kinesiology Biomechanics and Sports - Learn CBSE
14 pages
Machine Learning Algorithms Guide
No ratings yet
Machine Learning Algorithms Guide
9 pages
CADVR-1004FD / - 08FD: Honeywell Black
No ratings yet
CADVR-1004FD / - 08FD: Honeywell Black
4 pages
Agriengineering 06 00187
No ratings yet
Agriengineering 06 00187
18 pages
Singer 457U15, U125, U135, U140 Operator's Guide
100% (1)
Singer 457U15, U125, U135, U140 Operator's Guide
8 pages
NARAYANI MAHAL Job Fare
No ratings yet
NARAYANI MAHAL Job Fare
2 pages
Mtn66060008-Usermanual 2
No ratings yet
Mtn66060008-Usermanual 2
46 pages
Authentic Assessment Rubric - New Dog Breed
No ratings yet
Authentic Assessment Rubric - New Dog Breed
2 pages
6089202f4e466 The Amorphous Nature of Agile No One Size Fits All
No ratings yet
6089202f4e466 The Amorphous Nature of Agile No One Size Fits All
42 pages
Ocular Ischemic Syndrome Case Report
No ratings yet
Ocular Ischemic Syndrome Case Report
18 pages
Pumpe en 2023 v1
No ratings yet
Pumpe en 2023 v1
12 pages
Pega CSSA Cheat Sheet For OOTB Rules
No ratings yet
Pega CSSA Cheat Sheet For OOTB Rules
4 pages
Lance Design For Argon Bubbling in Molten Metal
No ratings yet
Lance Design For Argon Bubbling in Molten Metal
12 pages
Insurance Industry Career
No ratings yet
Insurance Industry Career
6 pages
Reflection Paper Guide for "The Billionaire"
No ratings yet
Reflection Paper Guide for "The Billionaire"
4 pages
Resume Vijayanand Maindkar
No ratings yet
Resume Vijayanand Maindkar
3 pages
Industrial Two Roll Mill Quotation
No ratings yet
Industrial Two Roll Mill Quotation
3 pages
Education, Arts, and Sciences
No ratings yet
Education, Arts, and Sciences
1 page