Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
18 views33 pages

Fedal #5

The document describes implementing the ID3 decision tree algorithm. It includes the pseudocode for ID3 algorithm, sample training and test datasets, and code to build the decision tree model and classify new data samples.

Uploaded by

sahuujjwal275
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views33 pages

Fedal #5

The document describes implementing the ID3 decision tree algorithm. It includes the pseudocode for ID3 algorithm, sample training and test datasets, and code to build the decision tree model and classify new data samples.

Uploaded by

sahuujjwal275
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

S.

No EXPERIMENTS DATE SIGNATURE

1. For a given set of training data examples stored in a .CSV


file, implement and demonstrate the Candidate
Elimination algorithm to output a description of the set
of all hypotheses consistent with the training examples.

2. Write a program to demonstrate the working of the


decision tree based ID3 algorithm. Use an appropriate
data set for building the decision tree and apply this
knowledge to classify a new sample.

3. Build an Artificial Neural Network by implementing the


Backpropagation algorithm and test the same using
appropriate data sets.

4. Write a program to implement the naïve Bayesian classifier


for a sample training data set stored as a
.CSV file. Compute the accuracy of the classifier, considering
few test data sets.

5. Assuming a set of documents that need to be classified,


use the naïve Bayesian Classifier model to perform this
task. Built-in Java classes/API can be used to write the
program. Calculate the accuracy, precision, and recall
for your data set.

6. Write a program to construct a Bayesian network


considering medical data. Use this model to demonstrate
the diagnosis of heart patients using standard Heart Disease
Data Set. You can use Java/Python ML library classes/API

7. Apply EM algorithm to cluster a set of data stored in a


.CSV file. Use the same data set for clustering using
kMeans algorithm. Compare the results of these two
algorithms and comment on the quality of clustering. You
can add Java/Python ML library classes/API in the
program.
Program 1
AIM: For a given set of training data examples stored in a .CSV file, implement and demonstrate
the Candidate Elimination algorithm to output a description of the set of all hypotheses consistent
with the training examples.

ALGORITHM:
Step1: Load Data set
Step2: Initialize General Hypothesis and Specific Hypothesis.
Step3: For each training example
Step4: If example is positive example
if attribute_value == hypothesis_value:
Do nothing
else:
replace attribute value with '?' (Basically generalizing it)
Step5: If example is Negative example
Make generalize hypothesis more specific.

TRAINING DATA SET:


FLOWCHART:
PROGRAM CODE:

import numpy as np
import pandas as pd
data = pd.DataFrame(data=pd.read_csv('finds1.csv'))
concepts = np.array(data.iloc[:,0:-1])
target = np.array(data.iloc[:,-
1]) def learn(concepts, target):
specific_h = concepts[0].copy()
print("initialization of specific_h and general_h")
print(specific_h)
general_h = [["?" for i in range(len(specific_h))] for i in range(len(specific_h))]
print(general_h)
for i, h in enumerate(concepts):
if target[i] == "Yes":
for x in range(len(specific_h)):
if h[x] != specific_h[x]:
specific_h[x] = '?
general_h[x][x] = '?' if target[i] == "No":
for x in range(len(specific_h)):
if h[x] != specific_h[x]:
general_h[x][x] = specific_h[x]
else:
general_h[x][x] = '?' print(" steps of Candidate Elimination Algorithm",i+1)
print("Specific_h ",i+1,"\n ")
print(specific_h)
print("general_h ", i+1, "\n ")
print(general_h)
indices = [i for i, val in enumerate(general_h) if val == ['?', '?', '?', '?', '?', '?']]
for i in indices:
general_h.remove(['?', '?', '?', '?', '?', '?'])
return specific_h, general_h
s_final, g_final = learn(concepts, target)
print("Final Specific_h:", s_final, sep="\n")
print("Final General_h:", g_final, sep="\n")

OUTPUT:
initialization of specific_h and general_h
['Cloudy' 'Cold' 'High' 'Strong' 'Warm'
'Change']
[['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?',
'?', '?', '?'], ['?', '?', '?', '?', '?', '?']]
steps of Candidate Elimination Algorithm 8
Specific_h 8
['?' '?' '?' 'Strong' '?' '?']
general_h 8
[['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', 'Strong', '?',
'?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?']]
Final Specific_h:
['?' '?' '?' 'Strong' '?' '?']
Final General_h:
[['?', '?', '?', 'Strong', '?', '?']]
Program 2

AIM: Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use an appropriate data set
for building the decision tree and apply this knowledge to classify a new sample.

ALGORITHM:

 Create a Root node for the tree


 If all Examples are positive, Return the single-node tree Root, with label = +
 If all Examples are negative, Return the single-node tree Root, with label = -
 If Attributes is empty, Return the single-node tree Root, with label = most common value of
Target_attribute in Examples

 Otherwise Begin
 A + the attribute from Attributes that best* classifies Examples
 The decision attribute for Root + A
 For each possible value, vi, of A,

 Add a new tree branch below Root, corresponding to the test A = vi


 Let Examples vi, be the subset of Examples that have value vi for A
 If Examples vi , is empty

 Then below this new branch add a leaf node with label = most common value of
Target_attribute in Examples

 Else below this new branch add the subtree


ID3(Examples vi, Targe_tattribute, Attributes - {A}))

 End
 Return Root
FLOWCHART:
TRAINING DATA SET:

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

TEST DATASET:

Day Outlook Temperature Humidity Wind

T1 Rain Cool Normal Strong

T2 Sunny Mild Normal Strong


PROGRAM CODE:

import pandas as
pd import math
import numpy as np

data = pd.read_csv("Dataset/4-
dataset.csv") features = [feat for feat in
data] features.remove("answer")

class Node:
def init (self):
self.children = []
self.value = ""
self.isLeaf =
False self.pred =
""

def entropy(examples):
pos = 0.0
neg = 0.0
for _, row in examples.iterrows():
if row["answer"] == "yes":
pos += 1
else:
neg += 1
if pos == 0.0 or neg == 0.0:
return
0.0 else:
p = pos / (pos +
neg) n = neg / (pos
+ neg)
return -(p * math.log(p, 2) + n * math.log(n, 2))

def info_gain(examples, attr):


uniq = np.unique(examples[attr])
#print ("\n",uniq)
gain = entropy(examples)
#print ("\n",gain)
for u in uniq:
subdata = examples[examples[attr] ==
u] #print ("\n",subdata)
sub_e = entropy(subdata)
gain -= (float(len(subdata)) / float(len(examples))) *
sub_e #print ("\n",gain)
return gain

def ID3(examples, attrs):


root = Node()

max_gain = 0
max_feat =
""
for feature in attrs:
#print ("\n",examples)
gain = info_gain(examples,
feature) if gain > max_gain:
max_gain = gain
max_feat = feature
root.value = max_feat
#print ("\nMax feature attr",max_feat)
uniq =
np.unique(examples[max_feat]) #print
("\n",uniq)
for u in uniq:
#print ("\n",u)
subdata = examples[examples[max_feat] ==
u] #print ("\n",subdata)
if entropy(subdata) == 0.0:
newNode = Node()
newNode.isLeaf =
True newNode.value =
u
newNode.pred = np.unique(subdata["answer"])
root.children.append(newNode)
else:
dummyNode = Node()
dummyNode.value = u
new_attrs = attrs.copy()
new_attrs.remove(max_feat)
child = ID3(subdata, new_attrs)
dummyNode.children.append(child)
root.children.append(dummyNode)

return root

def printTree(root: Node, depth=0):


for i in range(depth):
print("\t", end="")
print(root.value, end="")
if root.isLeaf:
print(" -> ",
root.pred) print()
for child in root.children:
printTree(child, depth + 1)

def classify(root: Node,


new): for child in
root.children:
if child.value == new[root.value]:
if child.isLeaf:
print ("Predicted Label for new example", new," is:",
child.pred) exit
else:
classify (child.children[0], new)

root = ID3(data, features)


print("Decision Tree is:")
printTree(root)
print (" ")

new = {"outlook":"sunny", "temperature":"hot", "humidity":"normal", "wind":"strong"}


classify (root, new)

OUTPUT:

The decision tree for the dataset using ID3 algorithm is

Outlook
Rain
Wind
Strong
No
Weak
Yes

Overcast
Yes

Sunny
Humidity
Normal
Yes
High
No

The test instance: ['rain', 'cool', 'normal', 'strong']


The label for test instance: no

The test instance: ['sunny', 'mild', 'normal',


'strong'] The label for test instance: yes
Program 3

AIM: Build an Artificial Neural Network by implementing the Backpropagation algorithm and test the same
using appropriate data sets.

ALGORITHM:

1. Create a feed-forward network with ni inputs, nhidden hidden units, and nout output units.

2. Initialize all network weights to small random numbers

3. Until the termination condition is met, Do


For each (𝑥, t), in training examples, Do

Propagate the input forward through the network:

1. Input the instance 𝑥, to the network and compute the output ou of every unit u in the
network.
2. Propagate the errors backward through the network
3. For each network unit k, calculate its error term δk
4. For each network unit h, calculate its error term δh
5.
Update each network weight wji
FLOWCHART:

Example Sleep Study Expected % in Exams

1 2 9 92

2 1 5 86

3 3 6 89
TRAINING DATA SET:
NORMALIZE THE INPUT:

Example Sleep Study Expected % in


Exams
1 2/3 = 0.66666667 9/9 = 1 0.92
2 1/3 = 0.33333333 5/9 = 0.55555556 0.86
3 3/3 = 1 6/9 = 0.66666667 0.89

PROGRAM CODE:

import numpy as np

X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)


y = np.array(([92], [86], [89]), dtype=float)
X = X/np.amax(X,axis=0) #maximum of X array longitudinally y =
y/100

#Sigmoid Function
def sigmoid (x):
return 1/(1 + np.exp(-x))

#Derivative of Sigmoid Function def


derivatives_sigmoid(x):
return x * (1 - x)

#Variable initialization
epoch=5 #Setting training iterations
lr=0.1 #Setting learning rate

inputlayer_neurons = 2 #number of features in data set


hiddenlayer_neurons = 3 #number of hidden layers neurons
output_neurons = 1 #number of neurons at output layer #weight
and bias initialization

wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))

wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))

#draws a random range of numbers uniformly of dim x*y for i


in range(epoch):
#Forward Propogation
hinp1=np.dot(X,wh)
hinp=hinp1 + bh hlayer_act
= sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
outinp= outinp1+bout
output = sigmoid(outinp)

#Backpropagation
EO = y-output
outgrad = derivatives_sigmoid(output)
d_output = EO * outgrad
EH = d_output.dot(wout.T)
hiddengrad = derivatives_sigmoid(hlayer_act)#how much hidden layer wts contributed to error
d_hiddenlayer = EH * hiddengrad

wout += hlayer_act.T.dot(d_output) *lr # dotproduct of nextlayererror and currentlayerop wh +=


X.T.dot(d_hiddenlayer) *lr

print ("-----------Epoch-", i+1, "Starts--------------")


print("Input: \n" + str(X)) print("Actual
Output: \n" + str(y)) print("Predicted
Output: \n" ,output)
print ("-----------Epoch-", i+1, "Ends--------------\n")
print("Input: \n" + str(X)) print("Actual
Output: \n" + str(y)) print("Predicted
Output: \n" ,output)

OUTPUT:

———–Epoch- 1 Starts———-
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.81951208]
[0.8007242 ]
[0.82485744]]
———–Epoch- 1 Ends———-

———–Epoch- 2 Starts———-
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.82033938]
[0.80153634]
[0.82568134]]
———–Epoch- 2 Ends———-

———–Epoch- 3 Starts———-
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.82115226]
[0.80233463]
[0.82649072]]
———–Epoch- 3 Ends———-

———–Epoch- 4 Starts———-
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.82195108]
[0.80311943]
[0.82728598]]
———–Epoch- 4 Ends———-

———–Epoch- 5 Starts———-
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.8227362 ]
[0.80389106]
[0.82806747]]
———–Epoch- 5 Ends———-

Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.8227362 ]
[0.80389106]
[0.82806747]]
Program 4

AIM: Write a program to implement the naïve Bayesian classifier for a sample training data set stored as
a .CSV file. Compute the accuracy of the classifier, considering few test data sets.

ALGORITHM:
STEP 1: Load the training data set from the CSV file into a list of dictionaries, whereeach dictionary represents
a single instance (row) in the data set and the keys representthe attribute names (columns) and the values
represent the corresponding attribute valuesfor that instance.

STEP 2: Determine the class variable for each instance in the training data set and add itas a new key- value
pair to the corresponding dictionary.

STEP 3: Create a dictionary to store the prior probabilities for each class variable in thetraining data set. The
key-value pairs should be of the form {class_variable:prior_probability}.

STEP 4: For each attribute in the training data set, create a dictionary to store theconditional probabilities for
each attribute value given each class variable. The key-valuepairs should be of the form {(attribute,
attribute_value, class_variable):conditional_probability}.

STEP 5: Compute the prior probabilities for each class variable by counting the numberof instances in the
training data set that belong to each class variable and dividing by thetotal number of instances.

STEP 6: For each attribute in the training data set, compute the conditional probabilitiesfor each attribute
value given each class variable by counting the number of instances inthe training data set that have that
attribute value and belong to each class variable, anddividing by the number of instances that belong to that
class variable.

STEP 7: Load the test data sets from CSV files into lists of dictionaries, following thesame format as the
training data set.

STEP 8: For each instance in each test data set, compute the posterior probability foreach class variable given
the attribute values in that instance, using the Naive Bayesianformula:P(class variable | attribute _values) =
P(class variable) *product(P(attribute_value | class variable) for attribute_value in attribute_values)

STEP 9: Determine the predicted class variable for each instance in each test data set asthe class variable with
the highest posterior probability.

STEP 10: Compare the predicted class variables to the actual class variables in each testdata set to compute
the accuracy of the classifier.

STEP 11: Output the accuracy for each test data set.

FLOWCHART:
TRAINING DATA SET:

Diabeti
Blood Skin
Examples Pregnancies Glucose Insulin BMI c Age Outcome
Pressure Thickness
Pedigre
e
Functio
n

1 6 148 72 35 0 33.6 0.627 50 1


2 1 85 66 29 0 26.6 0.351 31 0
3 8 183 64 0 0 23.3 0.672 32 1
4 1 89 66 23 94 28.1 0.167 21 0
5 0 137 40 35 168 43.1 2.288 33 1
6 5 116 74 0 0 25.6 0.201 30 0
7 3 78 50 32 88 31 0.248 26 1
8 10 115 0 0 0 35.3 0.134 29 0
9 2 197 70 45 543 30.5 0.158 53 1
10 8 125 96 0 0 0 0.232 54 1

PROGRAM CODE:

import csv import


random import
math

def loadcsv(filename):
lines = csv.reader(open(filename, "r"));
dataset = list(lines)
for i in range(len(dataset)):
#converting strings into numbers for processing
dataset[i] = [float(x) for x in dataset[i]]

return dataset

def splitdataset(dataset, splitratio):


#67% training size
trainsize = int(len(dataset) * splitratio);
trainset = []
copy = list(dataset);
while len(trainset) < trainsize:
#generate indices for the dataset list randomly to pick ele for training data index =
random.randrange(len(copy)); trainset.append(copy.pop(index))
return [trainset, copy]

def separatebyclass(dataset):
separated = {} #dictionary of classes 1 and 0
#creates a dictionary of classes 1 and 0 where the values are #the
instances belonging to each class
for i in range(len(dataset)):
vector = dataset[i]
if (vector[-1] not in separated):
separated[vector[-1]] = []
separated[vector[-1]].append(vector) return
separated

def mean(numbers):
return sum(numbers)/float(len(numbers))

def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1) return
math.sqrt(variance)

def summarize(dataset): #creates a dictionary of classes


summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]; del
summaries[-1] #excluding labels +ve or -ve
return summaries

def summarizebyclass(dataset):
separated = separatebyclass(dataset);
#print(separated)
summaries = {}
for classvalue, instances in separated.items():
#for key,value in dic.items()
#summaries is a dic of tuples(mean,std) for each class value
summaries[classvalue] = summarize(instances) #summarize is used to cal to mean and
std
return summaries

def calculateprobability(x, mean, stdev):


exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2)))) return (1 /
(math.sqrt(2*math.pi) * stdev)) * exponent

def calculateclassprobabilities(summaries, inputvector):


probabilities = {} # probabilities contains the all prob of all class of test data
for classvalue, classsummaries in summaries.items():#class and attribute information as mean
and sd
probabilities[classvalue] = 1
for i in range(len(classsummaries)):
mean, stdev = classsummaries[i] #take mean and sd of every attribute for class
0 and 1 seperaely
x = inputvector[i] #testvector's first attribute
probabilities[classvalue] *= calculateprobability(x, mean, stdev);#use normal
dist
return probabilities

def predict(summaries, inputvector): #training and test data is passed probabilities =


calculateclassprobabilities(summaries, inputvector) bestLabel, bestProb = None,
-1
for classvalue, probability in probabilities.items():#assigns that class which has he highest prob if
bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel = classvalue
return bestLabel

def getpredictions(summaries, testset): predictions


= []
for i in range(len(testset)):
result = predict(summaries, testset[i])
predictions.append(result)
return predictions

def getaccuracy(testset, predictions):


correct = 0
for i in range(len(testset)):
if testset[i][-1] == predictions[i]:
correct += 1

return (correct/float(len(testset))) * 100.0

def main():
filename = 'naivedata.csv'
splitratio = 0.67
dataset = loadcsv(filename);

trainingset, testset = splitdataset(dataset, splitratio)


print('Split {0} rows into train={1} and test={2} rows'.format(len(dataset), len(trainingset), len(testset)))
# prepare model
summaries = summarizebyclass(trainingset);
#print(summaries)
# test model
predictions = getpredictions(summaries, testset) #find the predictions of test data with the training
data
accuracy = getaccuracy(testset, predictions) print('Accuracy of
the classifier is : {0}%'.format(accuracy))

main()

OUTPUT:

Split 768 rows into train=514 and test=254

Rows Accuracy of the classifier is: 71.65354330708661%


Program 5

AIM: Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model to
perform this task. Built-in Java classes/API can be used to write the program. Calculate the accuracy, precision,
and recall for your data set.

ALGORITHM:

LEARN_NAIVE_BAYES_TEXT (Examples, V)

Examples is a set of text documents along with their target values. V is the set of all possible target values.
This function learns the probability terms P(wk |vj,), describing the probability that a randomly drawn word
from a document in class vj will be the English word wk. It also learns the class prior probabilities P(vi).

1. Collect all words, punctuation, and other tokens that occur in Examples Vocabulary + c the set of all distinct
words and other tokens occurring in any text document from Examples

2. Calculate the required P(vi) and P(Wk|vj) probability terms

 For each target value vj in V do


o docsj + the subset of documents from Examples for which the target value is vj
o P(vj) + | docs; | / |Examples|
o Textj +- a single document created by concatenating all members of docsj
o n + total number of distinct word positions in Textj
o for each word wk in Vocabulary
▪ nk + number of times word wk occurs in Textj
▪ P(Wk|Vj) + (nk+ 1) / (n +| Vocabulary|)

FLOWCHART:
TRAINING DATA SET:

Text Documents Label


1 I love this sandwich pos
2 This is an amazing place pos
3 I feel very good about these beers pos
4 This is my best work pos
5 What an awesome view pos
6 I do not like this restaurant neg
7 I am tired of this stuff neg
8 I can’t deal with this neg
9 He is my sworn enemy neg
10 My boss is horrible neg
11 This is an awesome place pos
12 I do not like the taste of this juice neg
13 I love to dance pos
14 I am sick and tired of this place neg
15 What a great holiday pos
16 That is a bad locality to stay neg
17 We will have good fun tomorrow pos
18 I went to my enemy’s house today
neg

PROGRAM CODE:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer from
sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

msg=pd.read_csv('naivetext.csv',names=['message','label']) print('The

dimensions of the dataset',msg.shape)

msg['labelnum']=msg.label.map({'pos':1,'neg':0})
X=msg.message
y=msg.labelnum

#splitting the dataset into train and test data


xtrain,xtest,ytrain,ytest=train_test_split(X,y)
print ('\n the total number of Training Data :',ytrain.shape) print
('\n the total number of Test Data :',ytest.shape)

#output the words or Tokens in the text documents cv =


CountVectorizer()
xtrain_dtm = cv.fit_transform(xtrain)
xtest_dtm=cv.transform(xtest)
print('\n The words or Tokens in the text documents \n') print(cv.get_feature_names())
df=pd.DataFrame(xtrain_dtm.toarray(),columns=cv.get_feature_names())

# Training Naive Bayes (NB) classifier on training data.

clf = MultinomialNB().fit(xtrain_dtm,ytrain) predicted


= clf.predict(xtest_dtm)

#printing accuracy, Confusion matrix, Precision and Recall


print('\n Accuracy of the classifier is',metrics.accuracy_score(ytest,predicted)) print('\n
Confusion matrix')
print(metrics.confusion_matrix(ytest,predicted))
print('\n The value of Precision', metrics.precision_score(ytest,predicted)) print('\n The
value of Recall', metrics.recall_score(ytest,predicted))

OUTPUT:

The dimensions of the dataset (18, 2)


1. I love this sandwich
2. This is an amazing place
3. I feel very good about these beers
4. This is my best work
5. What an awesome view
6. I do not like this restaurant
7. I am tired of this stuff
8. I can’t deal with this
9. He is my sworn enemy
10. My boss is horrible
11. This is an awesome place
12. I do not like the taste of this juice
13. I love to dance
14. I am sick and tired of this place
15. What a great holiday
16. That is a bad locality to stay
17. We will have good fun tomorrow
18. I went to my enemy’s house today

Name: message, dtype: object 0 1

1 1
2 1
3 1
4 1
5 0
6 0
7 0
8 0
9 0
10 1
11 0
12 1
13 0
14 1
15 0
16 1
17 0

Name: labelnum, dtype: int64

The total number of Training Data: (13,) The total number of Test Data: (5,)

The words or Tokens in the text documents


[‘about’, ‘am’, ‘amazing’, ‘an’, ‘and’, ‘awesome’, ‘beers’, ‘best’, ‘can’, ‘deal’, ‘do’, ‘enemy’, ‘feel’,

‘fun’, ‘good’, ‘great’, ‘have’, ‘he’, ‘holiday’, ‘house’, ‘is’, ‘like’, ‘love’, ‘my’, ‘not’, ‘of’, ‘place’,

‘restaurant’, ‘sandwich’, ‘sick’, ‘sworn’, ‘these’, ‘this’, ‘tired’, ‘to’, ‘today’, ‘tomorrow’, ‘very’, ‘view’, ‘we’,
‘went’, ‘what’, ‘will’, ‘with’, ‘work’]

Accuracy of the classifier is 0.8

Confusion matrix

[[2 1]

[0 2]]

The value of Precision 0.6666666666666666 The

value of Recall 1.0


Program 6

AIM: Write a program to construct a Bayesian network considering medical data. Use this model to
demonstrate the diagnosis of heart patients using standard Heart Disease Data Set. You can use Java/Python
ML library classes/API.

ALGORITHM:

 Collect medical data: Gather information on patients including demographics, symptoms,


diagnostic tests, and outcomes.

 Preprocess data: Handle missing values, encode categorical variables, and scale numerical
features as needed to prepare for analysis.

 Define nodes: Identify random variables such as age, sex, symptoms, tests, and diagnosis as nodes
in the Bayesian network.
 Determine edges: Establish conditional dependencies between nodes, reflecting causal
relationships and influences among variables.

 Learn CPDs: Utilize methods like Maximum Likelihood Estimation (MLE) or Bayesian estimation to
learn the Conditional Probability Distributions (CPDs) for each node from the data.

 Build DAG: Construct a Directed Acyclic Graph (DAG) representing the Bayesian network
structure, incorporating the learned CPDs.

 Perform inference: Use techniques like Variable Elimination to perform inference on new
patient data, calculating posterior probabilities.

 Diagnose: Determine the most probable states given the observed evidence to diagnose the
patient's condition.

 Validate performance: Assess the network's performance on validation data, evaluating


metrics like precision, recall, and F1-score.
 Refine network (optional): Iterate on the network structure and parameters based on additional
data or domain expertise to improve diagnostic capabilities and generalization.
FLOWCHART:

TRAINING DATA SET:

trest rest Old Heart


age sex cp bps chol fbs ecg thalach exang peak slope ca thal disease

63 1 1 145 233 1 2 150 o 2.3 3 o 6 o

67 1 4 160 286 o 2 108 1 1.5 2 3 3 2


67 1 4 120 229 o 2 129 1 2.6 2 2 7 1

41 o 2 130 204 o 2 172 o 1.4 1 o 3 o


62 o 4 140 268 o 2 160 o 3.6 3 2 3 3

60 1 4 130 206 o 2 132 1 2.4 2 2 7 4


PROGRAM CODE:

import numpy as np
import pandas as pd
import csv
from pgmpy.estimators import MaximumLikelihoodEstimator from
pgmpy.models import BayesianModel
from pgmpy.inference import VariableElimination
heartDisease = pd.read_csv('heart.csv') heartDisease
= heartDisease.replace('?',np.nan)
print('Sample instances from the dataset are given below')
print(heartDisease.head())
print('\n Attributes and datatypes')
print(heartDisease.dtypes)
model=BayesianModel([('age','heartdisease'),('sex','heartdisease'),('exang','heartdisease'),('cp','heart disease'),
('heartdisease','restecg'),('heartdisease','chol')])
print('\nLearning CPD using Maximum likelihood estimators')
model.fit(heartDisease,estimator=MaximumLikelihoodEstimator)

print('\n Inferencing with Bayesian Network:')


HeartDiseasetest_infer = VariableElimination(model)
print('\n 1. Probability of HeartDisease given evidence= restecg')
q1=HeartDiseasetest_infer.query(variables=['heartdisease'],evidence={'restecg':1})
print(q1)

print('\n 2. Probability of HeartDisease given evidence= cp ')


q2=HeartDiseasetest_infer.query(variables=['heartdisease'],evidence={'cp':2}) print(q2)

OUTPUT:
Program 7

AIM: Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data set for clustering
using kMeans algorithm. Compare the results of these two algorithms and comment on the quality of
clustering. You can add Java/Python ML library classes/API in the program.

ALGORITHM:

1. Identify the k data points as the initial centroids (cluster centers).

2. Repeat step 1

3. For each data point x ϵ D do.

4. Compute the distance from x to the centroid.

5. Assign x to the closest centroid (a centroid represents a cluster).

6. Re-compute the centroids using the current cluster memberships until the stopping criterion is met.

The EM Algorithm for Gaussian Mixtures

1. (Estimation, E): Calculate the expected value E[zij] of each hidden variable zij, assuming that the current
hypothesis h= holds.

2. (Maximization, M): Calculate a new maximum likelihood hypothesis h‟=, assuming the value taken on by
each hidden variable zij is its expected value E[zij] calculated in step 1. Then replace the hypothesis h= by the
new hypothesis h‟= and iterate.

FLOWCHARt
TRAINING DATA SET:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species


1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
7 4.6 3.4 1.4 0.3 Iris-setosa
8 5.0 3.4 1.5 0.2 Iris-setosa
9 4.4 2.9 1.4 0.2 Iris-setosa
10 4.9 3.1 1.5 0.1 Iris-setosa
11 5.4 3.7 1.5 0.2 Iris-setosa
12 4.8 3.4 1.6 0.2 Iris-setosa
13 4.8 3.0 1.4 0.1 Iris-setosa
14 4.3 3.0 1.1 0.1 Iris-setosa
15 5.8 4.0 1.2 0.2 Iris-setosa
16 5.7 4.4 1.5 0.4 Iris-setosa
17 5.4 3.9 1.3 0.4 Iris-setosa
18 5.1 3.5 1.4 0.3 Iris-setosa
19 5.7 3.8 1.7 0.3 Iris-setosa
20 5.1 3.8 1.5 0.3 Iris-setosa
21 5.4 3.4 1.7 0.2 Iris-setosa
22 5.1 3.7 1.5 0.4 Iris-setosa
23 4.6 3.6 1.0 0.2 Iris-setosa
24 5.1 3.3 1.7 0.5 Iris-setosa
25 4.8 3.4 1.9 0.2 Iris-setosa
26 5.0 3.0 1.6 0.2 Iris-setosa
27 5.0 3.4 1.6 0.4 Iris-setosa
28 5.2 3.5 1.5 0.2 Iris-setosa
29 5.2 3.4 1.4 0.2 Iris-setosa
30 4.7 3.2 1.6 0.2 Iris-setosa
31 4.8 3.1 1.6 0.2 Iris-setosa
32 5.4 3.4 1.5 0.4 Iris-setosa
33 5.2 4.1 1.5 0.1 Iris-setosa
34 5.5 4.2 1.4 0.2 Iris-setosa
35 4.9 3.1 1.5 0.1 Iris-setosa
36 5.0 3.2 1.2 0.2 Iris-setosa
37 5.5 3.5 1.3 0.2 Iris-setosa
38 4.9 3.1 1.5 0.1 Iris-setosa
39 4.4 3.0 1.3 0.2 Iris-setosa
40 5.1 3.4 1.5 0.2 Iris-setosa
41 5.0 3.5 1.3 0.3 Iris-setosa
42 4.5 2.3 1.3 0.3 Iris-setosa
43 4.4 3.2 1.3 0.2 Iris-setosa
44 5.0 3.5 1.6 0.6 Iris-setosa
45 5.1 3.8 1.9 0.4 Iris-setosa
46 4.8 3.0 1.4 0.3 Iris-setosa
47 5.1 3.8 1.6 0.2 Iris-setosa
48 4.6 3.2 1.4 0.2 Iris-setosa
49 5.3 3.7 1.5 0.2 Iris-setosa
50 5.0 3.3 1.4 0.2 Iris-setosa
51 7.0 3.2 4.7 1.4 Iris-versicolor
52 6.4 3.2 4.5 1.5 Iris-versicolor
53 6.9 3.1 4.9 1.5 Iris-versicolor
54 5.5 2.3 4.0 1.3 Iris-versicolor
55 6.5 2.8 4.6 1.5 Iris-versicolor
56 5.7 2.8 4.5 1.3 Iris-versicolor
57 6.3 3.3 4.7 1.6 Iris-versicolor
58 4.9 2.4 3.3 1.0 Iris-versicolor
59 6.6 2.9 4.6 1.3 Iris-versicolor
60 5.2 2.7 3.9 1.4 Iris-versicolor
61 5.0 2.0 3.5 1.0 Iris-versicolor
62 5.9 3.0 4.2 1.5 Iris-versicolor
63 6.0 2.2 4.0 1.0 Iris-versicolor
64 6.1 2.9 4.7 1.4 Iris-versicolor
65 5.6 2.9 3.6 1.3 Iris-versicolor
66 6.7 3.1 4.4 1.4 Iris-versicolor
67 5.6 3.0 4.5 1.5 Iris-versicolor
68 5.8 2.7 4.1 1.0 Iris-versicolor
69 6.2 2.2 4.5 1.5 Iris-versicolor
70 5.6 2.5 3.9 1.1 Iris-versicolor
71 5.9 3.2 4.8 1.8 Iris-versicolor
72 6.1 2.8 4.0 1.3 Iris-versicolor
73 6.3 2.5 4.9 1.5 Iris-versicolor
74 6.1 2.8 4.7 1.2 Iris-versicolor
75 6.4 2.9 4.3 1.3 Iris-versicolor
76 6.6 3.0 4.4 1.4 Iris-versicolor
77 6.8 2.8 4.8 1.4 Iris-versicolor
78 6.7 3.0 5.0 1.7 Iris-versicolor
79 6.0 2.9 4.5 1.5 Iris-versicolor
80 5.7 2.6 3.5 1.0 Iris-versicolor
81 5.5 2.4 3.8 1.1 Iris-versicolor
82 5.5 2.4 3.7 1.0 Iris-versicolor
83 5.8 2.7 3.9 1.2 Iris-versicolor
84 6.0 2.7 5.1 1.6 Iris-versicolor
85 5.4 3.0 4.5 1.5 Iris-versicolor
86 6.0 3.4 4.5 1.6 Iris-versicolor
87 6.7 3.1 4.7 1.5 Iris-versicolor
88 6.3 2.3 4.4 1.3 Iris-versicolor
89 5.6 3.0 4.1 1.3 Iris-versicolor
90 5.5 2.5 4.0 1.3 Iris-versicolor
91 5.5 2.6 4.4 1.2 Iris-versicolor
92 6.1 3.0 4.6 1.4 Iris-versicolor
93 5.8 2.6 4.0 1.2 Iris-versicolor
94 5.0 2.3 3.3 1.0 Iris-versicolor
95 5.6 2.7 4.2 1.3 Iris-versicolor

96 5.7 3.0 4.2 1.2 Iris-versicolor


97 5.7 2.9 4.2 1.3 Iris-versicolor
98 6.2 2.9 4.3 1.3 Iris-versicolor
99 5.1 2.5 3.0 1.1 Iris-versicolor
100 5.7 2.8 4.1 1.3 Iris-versicolor
101 6.3 3.3 6.0 2.5 Iris-virginica
102 5.8 2.7 5.1 1.9 Iris-virginica
103 7.1 3.0 5.9 2.1 Iris-virginica
104 6.3 2.9 5.6 1.8 Iris-virginica
105 6.5 3.0 5.8 2.2 Iris-virginica
106 7.6 3.0 6.6 2.1 Iris-virginica
107 4.9 2.5 4.5 1.7 Iris-virginica
108 7.3 2.9 6.3 1.8 Iris-virginica
109 6.7 2.5 5.8 1.8 Iris-virginica
110 7.2 3.6 6.1 2.5 Iris-virginica
111 6.5 3.2 5.1 2.0 Iris-virginica
112 6.4 2.7 5.3 1.9 Iris-virginica
113 6.8 3.0 5.5 2.1 Iris-virginica
114 5.7 2.5 5.0 2.0 Iris-virginica
115 5.8 2.8 5.1 2.4 Iris-virginica
116 6.4 3.2 5.3 2.3 Iris-virginica
117 6.5 3.0 5.5 1.8 Iris-virginica
118 7.7 3.8 6.7 2.2 Iris-virginica
119 7.7 2.6 6.9 2.3 Iris-virginica
120 6.0 2.2 5.0 1.5 Iris-virginica
121 6.9 3.2 5.7 2.3 Iris-virginica
122 5.6 2.8 4.9 2.0 Iris-virginica
123 7.7 2.8 6.7 2.0 Iris-virginica
124 6.3 2.7 4.9 1.8 Iris-virginica
125 6.7 3.3 5.7 2.1 Iris-virginica
126 7.2 3.2 6.0 1.8 Iris-virginica
127 6.2 2.8 4.8 1.8 Iris-virginica
128 6.1 3.0 4.9 1.8 Iris-virginica
129 6.4 2.8 5.6 2.1 Iris-virginica
130 7.2 3.0 5.8 1.6 Iris-virginica
131 7.4 2.8 6.1 1.9 Iris-virginica
132 7.9 3.8 6.4 2.0 Iris-virginica
133 6.4 2.8 5.6 2.2 Iris-virginica
134 6.3 2.8 5.1 1.5 Iris-virginica
135 6.1 2.6 5.6 1.4 Iris-virginica
136 7.7 3.0 6.1 2.3 Iris-virginica
137 6.3 3.4 5.6 2.4 Iris-virginica
138 6.4 3.1 5.5 1.8 Iris-virginica
139 6.0 3.0 4.8 1.8 Iris-virginica
140 6.9 3.1 5.4 2.1 Iris-virginica
141 6.7 3.1 5.6 2.4 Iris-virginica
142 6.9 3.1 5.1 2.3 Iris-virginica
143 5.8 2.7 5.1 1.9 Iris-virginica
144 6.8 3.2 5.9 2.3 Iris-virginica
145 6.7 3.3 5.7 2.5 Iris-virginica
146 6.7 3.0 5.2 2.3 Iris-virginica
147 6.3 2.5 5.0 1.9 Iris-virginica
148 6.5 3.0 5.2 2.0 Iris-virginica
149 6.2 3.4 5.4 2.3 Iris-virginica
150 5.9 3.0 5.1 1.8 Iris-virginica

PROGRAM CODE:

from sklearn.cluster import KMeans


from sklearn.mixture import GaussianMixture
import sklearn.metrics as metrics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width', 'Class']

dataset = pd.read_csv("8-dataset.csv", names=names)

X = dataset.iloc[:, :-1]

label = {'Iris-setosa': 0,'Iris-versicolor': 1, 'Iris-virginica': 2}

y = [label[c] for c in dataset.iloc[:, -1]]

plt.figure(figsize=(14,7)) colormap=np.array(['red','lime','black'])

# REAL PLOT
plt.subplot(1,3,1)
plt.title('Real')
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[y])

# K-PLOT
model=KMeans(n_clusters=3, random_state=0).fit(X)
plt.subplot(1,3,2)
plt.title('KMeans') plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[model.labels_])
print('The accuracy score of K-Mean: ',metrics.accuracy_score(y, model.labels_)) print('The
Confusion matrixof K-Mean:\n',metrics.confusion_matrix(y, model.labels_))

# GMM PLOT
gmm=GaussianMixture(n_components=3, random_state=0).fit(X)
y_cluster_gmm=gmm.predict(X)
plt.subplot(1,3,3) plt.title('GMM
Classification')
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[y_cluster_gmm])

print('The accuracy score of EM: ',metrics.accuracy_score(y, y_cluster_gmm)) print('The


Confusion matrix of EM:\n ',metrics.confusion_matrix(y, y_cluster_gmm))

OUTPUT:

The accuracy score of K-Mean: 0.24 The


Confusion matrix of K-Mean:
[[ 0 50 0]
[48 0 2]
[14 0 36]]
The accuracy score of EM: 0.36666666666666664 The
Confusion matrix of EM:
[[50 0 0]
[ 0 5 45]
[ 0 50 0]]

You might also like