Data Mining
Assignment no. 1
Question no. 1:
Let’s say we have a sample of 30 students with three variables Gender (Boy/Girl), Class (IX/ X) and
Height (5 to 6 C). 15 out of these 30-play cricket in leisure time. Now, we want to create a model to
predict who will play cricket during leisure period? In this problem, we need to segregate students who
play cricket in their leisure time based on highly significant input variable among all three.
Answer:
1. Entropy
First, we need to find entropy for parent node,
Entropy for Parent Node = -(15/30) log2 (15/30) - (15/30) log2 (15/30) = 1
the entropy is 1 which shows that it is an impure node.
For Split on gender:
Entropy for Female node = -(2/10) log2 (2/10) — (8/10) log2 (8/10) = 0.72
Entropy for Male node = -(13/20) log2 (13/20) — (7/20) log2 (7/20) = 0.93
Entropy for split Gender = (10/30)*0.72 + (20/30)*0.93 = 0.86
Information Gain for split on gender = 1–0.86 = 0.14
For Split on Class:
Entropy for Class IX node = -(6/14) log2 (6/14) — (8/14) log2 (8/14) = 0.99
Entropy for Class X node = -(9/16) log2 (9/16) — (7/16) log2 (7/16) = 0.99
Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99
Information Gain for split on Class = 1– 0.99 = 0.01
Above, you can see that Information Gain for Split on Gender is the Highest among all, so the tree
will split on Gender.
2. Gini
Secondly, we have to find Gini for the given data:
It works with categorical target variable “Success” or “Failure”.
For Split on gender:
Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59
For Split on Class:
Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the
node split will take place on Gender.
Now we have to find Gini impurity, So mathematically we can say,
Gini Impurity = 1-Gini
Question no. 2:
Draw a decision tree using pen and paper and then share it via email. Do it for the
given
– Create a decision tree for the dataset at the following link
Answer:
As it was a big data set and it was to hard to make that graph on paper so I decided to make a program
with uses different python libraries to make a decision tree on the given dataset of “Diabetes”
# -*- coding: utf-8 -*-
"""
Created on Mon Oct 28 19:47:33 2019
@author: sgfghh
"""
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
col_names = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age', 'diabetes']
# load dataset
pima = pd.read_csv("C:/Users/sgfghh/Downloads/data/data/diabetes_data.csv", header=None,
names=col_names)
print(pima.head())
feature_cols = ['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi', 'dpf', 'age']
#feature_cols = ['triceps', 'insulin', 'bmi', 'dpf', 'age']
X = pima[feature_cols] # Features
y = pima.diabetes # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training
and 30% test
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
OUTPUT:
Question no. 3:
Further, implement it in python. Apply decision tree, logistic regression and nearest neighbor to the
dataset. Then, compare the accuracy and the confusion matrix.
Code in Python:
"""
Spyder Editor
This is a temporary script file.
"""
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing our cancer dataset
dataset = pd.read_csv('C:/Users/sgfghh/Downloads/data/data/diabetes_data.csv')
X = dataset.iloc[:, 1:8].values
Y = dataset.iloc[:, 8].values
#print(dataset.head())
#print("Cancer data set dimensions : {}".format(dataset.shape))
#print(dataset.isnull().sum())
#print(dataset.isna().sum())
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Using Logistic Regression Algorithm to the Training Set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, Y_train)
#Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, Y_train)
#Using SVC method of svm class to use Support Vector Machine Algorithm
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, Y_train)
#Using SVC method of svm class to use Kernel SVM Algorithm
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, Y_train)
#Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, Y_train)
#Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)
#Using RandomForestClassifier method of ensemble class to use Random Forest Classification
algorithm
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)
Y_pred = classifier.predict(X_test)
print("Accuracy with each of our model", Y_pred)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
print("Confusion matrix", cm)
Output:
Accuracy with each of our model [1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0
0010100000000110000000000010100110000
1100001000000010001010001000110000101
1000000000000010000101000000010000010
1001001000100000000000100100000010000
0 0 0 0 0 1 0]
Confusion matrix [[117 14]
[ 31 30]]